The Automated Operations Journey: The Practical Benefits of Machine Learning in Root Cause Analysis
AI and Machine Learning (ML) are a perfect match for optimizing and automating some of today’s most complex telecom network management challenges. But are we witnessing yet another hype cycle – or is the industry really seeing tangible results? Somewhat surprisingly, we are doing better than one would expect. According to Forbes Magazine, telecom industry leads the way in return on investment when it comes to machine learning, and our findings have been just as positive.
In the past, while the growing volume of data has been an asset for telcos, slow information processing and limited data availability presented a challenge. Recent advancements in AI and applied ML are becoming a boon to service providers, because the availability of advanced technologies can be used to automatically identify hidden patterns and trends in network data. Now, with AI and ML, telcos can analyze bigger, more complex datasets and deliver faster, more accurate results – even on a massive scale. This has significant ramifications in the area of network optimization and service assurance.
Machine Learning and Fault Management: Lessons Learned
The use of AI and ML is essential in the path towards network transformation, especially in the lead up to 5G. The ability to automate and optimize network fault management is a critical requirement for technology advancements like NFV-SDN, SON, and 5G network slicing. When we look at the example of root-cause-analysis (RCA) – a very common telecom operations use case – it’s very clear that networks are becoming too complex and dynamic to continue to manually analyze and plot the root cause of each of the hundreds of thousands of alarms that typically hit a large network operations center every single day.
Anil Rao, Principal Analyst with Analysys Mason states that, “The ability to allow software programs and algorithms to learn insights and relationships by applying ML techniques means analytics can be applied to human-intensive operational use cases such as complex root-cause analysis routines.”
In the long-term, it’s important to evolve towards a vision where most of the fault analysis is fully automated, and manual processes are either entirely eliminated – or reduced to only truly exceptional cases. Meanwhile, the industry is taking steps to significantly improve the current manual processes by augmenting those with AI and ML-driven capabilities. For the past few years, TEOCO has been enhancing its existing fault management solutions with new AI and machine learning capabilities designed to improve the detection and management of network faults and improve efficiencies, in both the network operations center (NOC) and service operations center (SOC). Now, with several live deployments and tangible results under our belts, we’re ready to share some early findings and lessons learned.
Keeping it in the ‘Family’
One of the key areas of research in our work in ML-based root cause analysis (ML-RCA) is the ability to automatically detect parent/child relationships among network alarms. When a fault occurs in a telecom network, let’s call that the primary or parent fault, it almost always sets off a cascade of other faults from various network layers – creating a domino effect of ‘child’ faults that can quickly multiply and escalate to the point where it becomes almost impossible to manually determine where the problem originated. Efforts to untangle these parent/child fault clusters is called root cause analysis, or RCA. In absence of automated correlation engines, NOC engineers typically spend several minutes – or even hours – correlating which faults and alarms are related, and then analyzing them to determine the root cause. This is typically done manually, or through a traditional rules-based analysis. This approach is time consuming and resource-intensive, and in many instances, leads to the wrong root-cause and redundant trouble-tickets. In today’s fast-paced networks, a new approach was required; so, we began a feasibility study that incorporated machine learning techniques into root cause analysis.
NOC engineers need the tools to manage RCA in a more responsive, automated way.
There’s an old saying that goes, “when everything is urgent, nothing is urgent.” That is the challenge when it comes to managing network faults. There are so many alarms that they can become white noise to the NOC engineer. TEOCO’s product team looked at specific advancements in machine learning that could be applied towards prioritizing alarms, reducing the ‘noise’ and improving root cause analysis. When TEOCO implemented machine learning based RCA techniques into our HELIX Fault Management solution, we detected significant improvements across all four target objectives:
The Goal: Improve alarm suppression rates. In today’s complex networks, the ability to analyze and prioritize alarms in a fault management system is critical for NOC engineers to effectively do their job.
The Results: The number of active alarms presented to the NOC was reduced by automatically creating parent / child relationships. Before activating ML-RCA, just under 5 percent of all alarms were correlated as child alarms (this included both manual and rules-based correlations). After activating ML-RCA, there was a 10x improvement in the identification of child alarms. This metric is expected to increase even further over time as the algorithm “learns” to detect more network failures.
The Goal: Reduce the need for manual alarm correlations. The correlation process, deciding which alarms are part of the same ‘parent’ fault and linking them to a single trouble ticket, is a time consuming, manual process that takes up valuable resources.
The Results: Initial results showed a ~75 percent reduction in the number of manual correlations required performed by NOC engineers. The amount of manual correlations and the time spent on making them decreased significantly after ML-RCA was activated.
The Goal: Reduce time required to correlate alarms. Too much time and effort are required to determine which alarms are interrelated and stemmed from the initial ‘parent’ fault. The average time spent manually correlating alarms can be nearly 90 minutes per issue.
The Results: After ML-RCA, correlation times essentially become instantaneous as the average alarm correlation time decreases by nearly 95%. While manual correlations can be performed relatively quickly for simple, known and repetitive cases, it still takes longer than automatic correlation done by ML-RCA. The most significant reduction, however, is in complex cases that demand more time for evaluation and assessment.
The Goal: Reduce time for trouble ticket creation. Before a trouble ticket can even be issued, each fault needs to be identified, correlated or dismissed. Carriers average close to 60-90 minutes as the typical amount of time between a fault happening in the system, to having a trouble ticket created and entered so it can be repaired. Our goal was to significantly reduce this so the failure repair effort can start much sooner and be performed more efficiently.
The Results: With ML-RCA, we were able to reduce the time it took to create trouble tickets by over 86%, and this was achieved just by improving fault correlations. By reducing alarm analysis time, trouble tickets can be opened substantially faster. The engineers / technicians received the issue sooner and were able to minimize the service impact on subscribers.
Machine Learning Shows Drastic Improvements in RCA
In addition to the impressive results achieved by applying the model’s “knowledge” to real-time alarm streams, our customers also find a lot of value in exploring the model’s findings to identify intermittent network faults, trends, nontrivial network behavior and dependencies and recurring failures, or ‘flash faults’. We discovered that these ‘flash-faults’ typically go unseen or are ignored– yet they still cause havoc in the network and create customer experience issues that frequently go unnoticed. In these instances, it is only with ML-RCA that these parent faults can be detected and repaired to reduce future failures in other areas of the network. Proof that what we don’t see can hurt us, and that there is real value in using machine learning in areas like root cause analysis.
Helix 10.2 is TEOCO’s latest version of its flagship service assurance suite of solutions. It leverages our recent investments in machine learning algorithms and artificial intelligence to improve network failure predictions and root cause analysis, drive proactive assurance, and enable advanced automation. Click here to learn more about Helix 10.2.