The case for machine learning in network security analysis

machinelearning
Security tends to scale badly with complexity. As information, applications and systems become more sophisticated so to do the challenges faced in assuring Confidentiality, Integrity and Availability. The role of machine assistance is emerging as one of the most important areas in data science, much of which is underpinned by techniques from machine learning and statistics. Using these techniques we are developing characterisation engines that provide baseline and anomaly detection capabilities, alleviating the need for explicit signatures to be written. Such techniques will become increasingly important as the range and capability of networked devices increases – as evidenced by the recent exploitation of IoT devices to mount large-scale Distributed Denial of Service (DDoS) campaigns. Encompassing all areas of systems design and engineering, information security presents a unique problem space in which these techniques can be developed and tested.

The case for machine learning in network security analysis

In the context of information security, the possible applications of machine learning and many and varied. Security analysis of network data is complex, involving datasets that are large in terms of both their volume and density. We have been working on how machine learning can be used to support human analysts in making sense of the data produced by the systems the monitor; and in helping them to make decisions based on insights they draw from this analysis. We have developed a set of bespoke classification, clustering and anomaly detection techniques in python, SPL and C to characterise networks based on the events taking place within them. These include:

  • Applying multiple learning models including semi-supervised, unsupervised and re-enforcement;
  • A classifier to detect interactive webshell traffic using only low resolution data sources, such as netflow or IPFIX;
  • Domain-specific distance functions to express
  • A heuristic anomaly detection algorithm built using probabilistic techniques and operating at medium-dimensionality; and
  • A set of clustering techniques built purely for network datasets, capable of processing both summary and full-capture (such as PCAP) data.

The overriding purpose of these techniques is to develop re-usable, but domain-specific tools that help the analyst to make greater use of their monitoring data. An important aspect of this process is computing in environments where data can be incomplete, multivariate and which represents very different types of infrastructure.

The Curse of Dimensionality

In common with many other ‘Big Data’-like problems, security analysis can quickly fall a foul of The Curse of Dimensionality – that is to say, the problem escalates into (potentially very) high-dimensional spaces, bringing with it significant computability and scalability challenges. Analysis of network datasets often leads to a comparatively large number of inferred dimensions being built atop native data points such as IP addresses, directionality, byte counts, protocols used, etc. In Emerging Technology we have developed a set of corresponding techniques for dimension reduction, specifically for use within this area – notably non-linear manifold learning techniques by domain-tuning Principal Component Analysis and T-distributed Stochastic Neighbour Embedding. These techniques allow the user to better manage their data, but also open-up the problem space to techniques that would otherwise be infeasible.

Next steps…

An important component of this work is the continued exploration of techniques that support human analysis. We are now actively researching and developing techniques based on Cellular Automata (CA) to model packet properties and movements through a network. The purpose is to determine the value of evolutionary computing techniques, plus explore how CA may be used as the basis of a new type of visualisation engine. Finally, we are beginning to explore the use of the IBM 5Q in developing quantum algorithms for large-scale network event monitoring – in many respects the Many Bodies Problem of network security.