Machine learning and service assurance: How to focus on the disease instead of the symptoms
13 SEPTEMBER 2018
Are you feeling dizzy, weak in the arms, or having trouble seeing, speaking, or moving? A quick Google search of these symptoms reveals you may just have a common migraine, or you may be suffering from something more serious, such as multiple sclerosis or even a stroke. So, how do you determine the right course of treatment? The first step should be to trade in your Google search for a trained professional. The same applies to next generation networks. Network symptoms – or alarms – don’t always tell the full story.
Service assurance of virtual and hybrid networks and service chains have become incredibly complex, to the point that automation is the only way forward. With the rise of virtualized, software driven networks, service assurance is more decentralized. Distributed across multiple NFV management layers. This decentralization creates more network alarms, making it harder to determine where the real problems reside. But all these disparate, embedded assurance functions still need to work together. Carriers are finding it increasingly difficult to understand which issues need to be prioritized and which are just ‘noise.’ To take advantage of the scalability, self-healing, and flexibility of NFV/SDN-based networks, communication service providers (CSPs) will need to redesign their legacy topology discovery technology. Integrating legacy technology with advanced machine learning capabilities will help service assurance teams proactively plan and maintain a healthy network.
For instance, service assurance teams can use machine learning capabilities to improve abnormality detection within the network. In this case, the aim is to identify problems at their onset, before they impact service. Service assurance teams can base outcomes on data from network performance or from alarms. For threshold management, teams can use another family of algorithms, which focus on the ‘value’ of certain metrics and can indicate when to apply exceptions to thresholds. For example, at the most simplistic level, if the network bandwidth can go too low to provide a certain service, the carrier can insert a threshold alert to create an alarm when this happens. But we can go a step further by using statistical techniques, such as creating a standard deviation for trend analysis and forecasting and detecting abnormalities in the network. We can even pre-empt a situation that may result in poor service a few hours down the road. Some service providers have implemented these techniques for reporting purposes, but when they are not part of the automated process it limits their value. Today’s networks require not just real-time intelligence, but automated, real-time actions that ‘close the loop.’
Machine learning based self-learning algorithms take us a step closer to closed-loop service assurance nirvana by allowing the network to autonomously maintain a healthy state. Self-learning algorithms let service providers create a baseline profile that automatically identifies when exceptions occur. This allows us to create adaptive thresholds. An example of an adaptive threshold is when you have a growing residential neighborhood that doubles in size in a relatively short amount of time, creating a lot more demand for wireless data. The network behavior changes to reflect the growing demand. A hard-coded threshold that limits bandwidth on a rules-based system would need to be reconfigured. But if a network is self-learning, it would become evident that the pattern of behavior has changed over time, and the system would automatically modify the current thresholds based on this ‘learned’ information. All the work is done for you. By leveraging the machine learning algorithms, the thresholds adjust automatically to changes in the network, reducing the need for manual intervention by making the network more agile. And when millions of these decisions need to be made daily, automation becomes the only path forward.
Machine learning algorithms not only help networks become more agile but can also help the Network Operations Center (NOC) to resolve issues more quickly. While the technology is complex, the goal is straightforward; to produce high-value predictions and recommendations resulting in better decisions and actions in real time. The approach is to quickly and automatically produce models that can analyze large, complex data sets and deliver faster, more accurate results. One way it does this is by learning from the operations team in the NOC as to how they make their decisions. The algorithms begin to understand which alarms relevant, and which ones can be deprioritized. Alarms are automatically tagged to indicate how they were resolved in the past, enabling NOC operators to quickly see what requires their immediate attention.
The concept of machine learning has been around since the early-to-mid 1950s, with early work being done by pioneers like Alan Turing and others. It has quietly experienced a transformation in the past few years. The addition of machine learning algorithms will help service assurance teams to maintain a healthy network by predicting and neutralizing potential issues and by self-adjusting when changes in the network require it.