16 January 2018

MLAD: Machine Learning for Anomaly Detection

Modern industrial control systems (ICS) are cyber-physical systems that include both IT infrastructure and operational technology (OT) infrastructure. Attacks on OT pose the greatest danger and are very difficult to detect. Machine Learning for Anomaly Detection (MLAD) technology is designed to protect OT.

The nature of attacks on ICS

The main purpose of an industrial control system (ICS) is to ensure the continuity of OT processes. Any OT failures can very quickly cause irreversible damage to equipment and result in enormous financial losses. Thats why, in addition to stepping up protection of its digital environment (IT), it’s essential to protect OT – the main functionality of an industrial control system.

Attacks on the IT infrastructure of an ICS may result in unusual behavior by programs, communications or even equipment. Some attacks on the digital environment may actually target the OT infrastructure (spoofing of digital sensor data and commands, modification of the digital control logic, denial of service, etc.). An attacker may also find a vulnerability in the digital environment of a cyber-physical system and exploit it to attack OT processes.

Attacks that target the OT infrastructure may cause errors in sensor data, control commands or the control logic. Some attacks on OT may not be launched from the digital environment but have a purely physical nature (a blocked valve, a disconnected sensor, or the connection of a fake sensor).

While attacks on OT pose the greatest danger to an industrial facility, detecting them is an extremely complicated task. Our MLAD technology helps to improve the detection of attacks on OT using machine learning. The technology provides an important additional layer in the protection of cyber-physical systems and is designed to protect OT processes – regardless of the nature of an attack.

Protection based on analyzing telemetry

In the modern world, as technologies evolve, the number of threats to industrial systems grows exponentially. At the same time, when it comes to protecting OT in industrial control systems, protection is still based on expert rules – essentially, signatures. However, providing reliable protection based on rules alone is not feasible. Relying on industrial systems being isolated from other networks is also wrong: to work efficiently, an industrial enterprise requires interaction between its own facilities and with its partners. This interaction often involves transferring error-sensitive industrial process data.

Machine learning can be effectively used to protect OT, because the operating conditions, input materials, and production objectives often change – and machine learning can keep up with the changes. At the same time, reconfiguring an expert system as quickly is difficult and expensive.

Machine learning-based cybersecurity products available today analyze executable code and network communication data. There are virtually no solutions that analyze data on the OT process application level.

Unlike most existing technologies, MLAD operates at the level of industrial system signals.

ICS telemetry signals have the following key features:

  • Very large number of telemetry signals (typically, tens of thousands of different tags);
  • High update frequency (typically, 10 times/sec);
  • Extensive history (can be accumulated for years);
  • ‘Noisy’ data (measurement errors, missing data, data received at irregular intervals, different types of spikes, etc.);
  • Interconnection between different signals (based on control logic and physical laws).

The last point is key to the approach we have developed. The signals (the values of sensors, commands, and control logic parameters) are closely interconnected and this interconnection is determined by physical laws and the logic of industrial processes.

At a large industrial facility, the number of such interconnections between signals is enormous. Even a very experienced engineer may not be aware of all of them. They were established when designing the ICS control logic and are defined by the operating conditions, input parameters and other factors.

Such interconnections between signals mean that an attack on any part of the signals or on ICS components inevitably affects other industrial signals. Machine learning can ‘learn’ these correlations and subsequently detect changes caused by attacks.

MLAD collects and analyzes the values of sensors, actuators and setpoints. MLAD detects deviations in the normal behavior of industrial processes. This is done irrespective of who introduced changes to a process and how or whether this was done intentionally or accidentally. It is even possible that an attacker’s activity doesnt manifest itself at the network level, but MLAD detects the attack’s manifestations at the industrial process level, and visualizes the deviations in telemetry terms that are easy for the operator to understand.

MLAD: anomaly detector

Using correlations in industrial traffic signals, MLAD can train a recurrent neural network to recognize signal behavior under normal operating conditions. The MLAD algorithm is based on a Long-Short Term Memory (LSTM) recurrent neural network. The data is presented as a multivariate time series.

After being trained, MLAD predicts the values of all signals in real time for a certain future time interval and compares them with observed values. If the prediction error is greater than a statistically matched threshold defined at the training stage, MLAD detects an anomaly and sends an alert.

Anomalies can include changes to a signal’s amplitude, period, or synchronization phase between different signals.

MLAD can detect anomalies caused by:

  • Spoofed sensor values and commands;
  • Changes to the control logic (setpoints, PLC parameters);
  • Physical attacks on the equipment;
  • Equipmentfailure;
  • Changes to the external environment conditions;
  • Unusual material and process input parameters.

The first three points may be linked to attacks on the facility, in which case MLAD plays the role of an attack detector.

If anomalies related to the latter three points are detected, MLAD is essentially used as a predictive monitoring tool.

Today, although there is much talk about the interpretability of machine learning algorithms, few solutions can provide interpretation. The MLAD technology not only detects but also interprets anomalies, i.e., provides specific information on things that have gone wrong. MLAD localizes the signal whose behavior showed the greatest deviation from normal behavior. Special emphasis is placed on early detection of anomalies: significant deviations are usually detected much earlier than the emergency shutdown (ESD) system is triggered.

Example of MLAD operation

The MLAD technology is a pilot implementation and is available to Kaspersky Industrial CyberSecurity customers who would like to have this functionality on request, in test mode.

The MLAD module provides:

  • Reliable anomaly detection and localization;
  • Processing of thousands of different signals per second;
  • Storage and visualization of several years of history;
  • Online visualization for dozens of signals;
  • Re-training when standard conditions change;
  • GPU computing support.

The pilot version of MLAD is integrated with Kaspersky Industrial CyberSecurity for Networks – the Kaspersky Lab industrial network security solution. Kaspersky Industrial CyberSecurity for Networks performs comprehensive analysis of industrial protocols – Deep Packet Inspection (DPI) and provides MLAD with the industrial signal values identified.

The operation of MLAD can be demonstrated on the well-known chemical process model called the Tennessee Eastman Process (TEP).

The Tennessee Eastman process

In the 1990s, engineers at the Tennessee Eastman plant made a detailed mathematical model of a chemical industrial process available on the Internet. This was done primarily to refine various industrial process control models (including PLC logic, etc.).

The Tennessee Eastman Process (TEP) models includes four main units. The gaseous reactants exothermically react in the reactor. The products leave the reactor as vapors and are fed into the condenser and then into the vapor-liquid separator. Condensed components move to a product stripping column to remove remaining reactants. The process produces two products.

This is a chemical manufacturing process. However, such units are typical of many industrial environments. And, as we analyze anomalies in TEP, we can see many similarities with other production processes.

The demonstration stand

Based on the TEP process model, we have developed a Python mathematical model to simulate the physical processes in the system, as well as developing a PLC program to implement control logic for the physical model. To visualize the processes being simulated, we have implemented a 3D TEP model and linked it with the generated physical model and PLC telemetry. To control the stand, we have developed a dedicated iPad console that can be used to simulate a variety of cyberattack scenarios and perform comprehensive testing of MLAD algorithms.

The stand is deployed on one laptop computer and includes the Tennessee Eastman Process mathematical model, its 3D visualization, Kaspersky Industrial CyberSecurity and MLAD. A Schneider controller is used as a PLC. A switch is used to mirror traffic between the PLC and the mathematical model and send it to Kaspersky Industrial CyberSecurity. Kaspersky Industrial CyberSecurity interacts with MLAD.

The model has numerous parameters that we can track, including sensors and commands – a total of about 60 tags. In addition, business parameters are defined that enable the enterprise’s operating costs to be calculated (on an hourly basis). This helps to provide a comprehensive assessment of the damage from a hacker attack – the enterprise may suffer financial losses even if an attack doesn’t result in a serious incident (such as an explosion/disaster).

The video below, which is demonstrated on the stand, reflects a simple data spoofing attack scenario. There are three reactant gases. Sensors show the gas flow into the reactor for each gas. The scenario involves spoofing the value of one tag that corresponds to the reading for the flow of gas A, as a result of which the controller receives information that the gas is not flowing at all. Consequently, the controller opens the valve, increasing the gas flow. As a result of the controller using spoofed sensor readings, the valve is opened completely. Three hours after the attack begins, the pressure in the reactor exceeds the threshold level.

According to the scenario, the emergency shutdown (ESD) system is disabled for some reason. If the ESD was enabled, it would only have shut off the gas flow at this stage – this marks the ESD time.

At the same time, it can be seen on the MLAD monitor that the readings began to grow and exceeded the threshold (with MLAD detecting an anomaly and sending an alert) very early in the simulation. According to the simulation, the difference between the anomaly being detected by MLAD and the ESD being triggered is 3 hours and 8 minutes. This is sufficient time for the operator to take the necessary action and to prevent an accident. This means that MLAD provides early anomaly detection.

When MLAD detects an anomaly, detailed information on the signals where the error is greatest is displayed on the monitor. The information provided enables the operator to conclude what is wrong and in which part of the system. In other words, MLAD provides anomaly interpretation.

Conclusion

Approaches to ICS protection based on analyzing anomalies began to emerge as machine learning technologies became widespread. These technologies have now evolved to a level where they can be used to develop telemetry-based protection of operational technology in industrial systems.

Machine learning algorithms can cover a much broader range of connections between industrial signals than a traditional rule-based expert protection system. In an expert system, rules are commonly generalized (desensitized) in order to make them applicable to a broad range of conditions. This results in late triggering of emergency shutdown (ESD) systems. A more precisely tuned system based on machine learning can respond to anomalous process changes much earlier.

When applying machine learning to telemetry under an industrial facility’s normal operating conditions, an equivalent of whitelisting can be developed – the result is a machine learning (ML) model that can recognize the ‘white’ behavior of industrial processes.

Approaches based on machine learning do not eliminate the need for expert systems but rather complement such systems – just like the security industry, where signature-based detection methods are still used in addition to heuristic analysis, machine learning and whitelisting.

Useful information:

  1. RNN-based Early Cyber-Attack Detection for the Tennessee Eastman Process. ICML 2017 Time Series Workshop, Sydney, Australia, 2017.
  2. Multivariate Industrial Time Series with Cyber-Attack Simulation: Fault Detection Using an LSTM-based Predictive Data Model. NIPS 2016 Time Series Workshop, Barcelona, Spain, 2016.
  3. ICS Anomaly Detection Panel

For any questions contact us at mlad[a]kaspersky.com

Authors
  • Head of Technology Research Department, Future Technologies