Service Assurance Automation: Advanced Network Operations in the 5G Era
Will 2018 be remembered as the year the telecom industry got serious about automation – or is it just more talk? Some say there’s good reason for caution, believing “to err is human; to propagate errors massively at scale is automation.” But naysayers aside, most everyone agrees today’s networks are becoming too complex to keep things status-quo.
What will 2019 bring? As part of TEOCO’s Expert Voices series, we get a better understanding of what’s ahead – and what’s at stake – when we talk with TEOCO’s service assurance expert.
Q: When it comes to service assurance, what problems does automation help to resolve – and how?
A: Automation in service assurance helps resolve several problems. As you mentioned, networks are becoming increasingly complex from a technology perspective, with more network elements at play, and cross-influences between these elements. Over time this has led to inefficient operational processes – with too many situations that require human intervention, collaboration, and management approval.
Additionally, customer needs and expectations are changing. Meaning that the frequency of required changes and rollouts of new services that rely on network resources is higher than ever before. This dynamic nature and complexity of the network makes it especially difficult to trace the root cause of complex issues. There may be a problem in one place in the network that quickly creates a ripple effect – leading to more problems piling up elsewhere in the network and making it difficult to trace the source. This is especially true when it comes to one-off problems, or those that don’t occur very often.
As network faults begin to escalate, there are only a limited number of network operations center (NOC) and service operations center (SOC) experts on hand. They become the bottlenecks, and their hands are always full. This can quickly lead to alert overload and fatigue. Everyone has their limits, and teams can quickly find themselves with too many alarms to manage if they don’t have the proper analytics and automation tools to sort them out and prioritize.
Q: Automation is attractive from an operational perspective, but what are the key applications that should be automated?
A: At TEOCO we’ve identified quite a few use cases that can be automated, especially with regards to how faults and alarms are managed. In general, we’ve broken this down into four different application categories.
The first is fault monitoring. In this area carriers can use automation to help with alarm normalization, enrichment, and escalation.
The second is ticketing. As soon as an alarm is triggered and escalated, the ticketing system can automatically open a trouble ticket and start orchestrating the corresponding flow of actions, while providing the means to fully document the problem and the steps to solve it.
The third is correlation, which is all about connecting the dots. For this, we use machine learning techniques to analyze how faults impact services and customers by automatically filtering and correlating alarms across different domains. We can then use machine learning for root-cause analysis, to automatically identify the source of the problem. And, in the future, we’re looking at being able to automatically predict and detect future faults and network failures, to proactively fix problems before they even happen.
The fourth and final category is event resolution, which involves automatically correcting faults and restoring function. This is the real ‘Nirvana’ of service assurance – it’s what we call closing the loop. Not only are we automatically troubleshooting and diagnosing the problems, but we’re also automatically creating alert notifications and trouble tickets, and then restoring function. In effect, creating an automated NOC.
Q: In your opinion, are there any areas that should NOT be automated?
A: This is a highly personal decision and is dependent on customer requirements – and comfort level. In some cases, NOC personnel prefer manually controlling certain alarm flows. For example, they may want to more closely examine issues that could impact critical network resources or specific VIP customers, before acting.
Q: Why has it taken so long for automation to be more widely adopted?
A: One reason has been the lack of available automation tools. Only in the last couple of years have we seen the types of advanced analytics tools that could facilitate full automation of many NOC and SOC processes.
Another inhibitor has been the concern of the carriers’ network and service operations employees with regards to job cuts. On the one hand these folks are expected to embrace automation, on the other hand, some of them realize they might lose their jobs as a result. Also, automation involves changes in existing processes and requires setting up detailed automation rules – both of which take time to adjust and deploy.
Given the fact that most communications service providers have been focusing on rolling out new services and other revenue-generating activities and trying to manage massive technology shifts from the likes of Network Function Virtualization (NFV) and 5G, automation hasn’t always been the top priority. But this is gradually changing. In 5G networks, for example, “closed-loop” automation is required between service assurance and the network orchestrator. There’s simply too much activity for humans to manage without it.
The good news is that TEOCO provides a service assurance solution with all necessary capabilities and standard APIs that enable such closed-loop automation.
Q: Does automation impact the human skill set needed for a NOC?
A: It really comes down to automation enables NOC engineers to allocate more of their time to more advanced tasks and issues that require deeper expertise. It facilitates “smarter” work. Instead of focusing on executing routine manual operational processes, the NOC engineers can automate the common processes, investigate issues that are exceptional, and prevent additional problems. In other words, with automation, NOC teams can do more and provide better service with fewer people than without automation.
Q: How is TEOCO investing in helping its customers automate their networks?
A: TEOCO has been investing in automation tools for several years. Our Helix Service Assurance solution includes an integrated collection of products that include automated alarm processes. For example, our machine learning root cause analysis automatically identifies the root cause of complex problems, regardless of network topology. Another example is the Helix Screener module, a tool that automatically prioritizes and filters the stream of alarms. We also have the FaultPro module, which provides automatic and semi-automatic fault corrections with flexible, pre-defined rules. In addition, we can automatically predict failures, so that NOC experts can focus their attention on emerging problems before they escalate and become significant.
Q: Clearly there are many benefits of automation – can carriers survive without it?
A: Most carriers already view automation as a strategic goal that enables, or will enable, them significant cost savings. For instance, Germany’s Deutsche Telekom is targeting $1.8 billion USD in annual indirect cost savings by 2021, thanks to automation savings .
But in fact, it is about much more than cost savings. Automation is about the ability to provide better service to customers by predicting and preventing potential issues in advance and resolving those issues more quickly.