Part 2: Service Assurance and Automation – Defining and Delivering ROI – Key Operator Concerns
In this interview, Dima Alkin, TEOCO’s Vice President of Service Assurance, discusses ways in which service providers are achieving an ROI from service assurance automation efforts.
Q) When it comes to using ML and AI in service assurance, what are operators most concerned about? What are their goals and objectives?
Dima: Operational efficiency, cost savings and network automation are the three key areas of focus when it comes to leveraging ML and AI for network management and assurance. But the way to address these issues is different depending on the specific operator starting point and the desired outcomes. The big separation that we see again and again is between those that are building entirely new “green field” networks, and operators with existing networks.
Most service providers globally are trying to optimize whatever network and processes they already have in place. Some have networks that are more software driven, some are less software driven. But they are all looking to increase their level of OSS/BSS automation – including network management and service assurance.
At the same time, we are also seeing a new generation of operators coming into the market, like Rakuten, Dish and others. They don’t have a legacy network to manage, so they can focus on automating as much as possible from day one. This also means that they can experiment much more and push the envelope in terms of ML and AI. Instead of building manual or semi-automated processes and then moving towards automation, greenfield operators can design their networks and OSS layers this way from day one.
These really are two different discussions – depending on whether the network is new or existing. Even though the goals may be the same, the path operators need to take to get there will be vastly different. We are helping to automate service assurance processes for customers of all types. Even when the project has nothing to do with 5G and is focused on automating existing processes, we can still achieve great results.
Q) The issue of network and service orchestration is a major focus within our industry right now. How do you see analytics playing a role?
Dima: Today, there is a lack of mature standards on this issue. There is no agreed framework for how analytics should be used as part of this value chain with relation to network management. While there are many industry initiatives that we actively follow and support, like TM Forum’s maturity model and other blueprints, there is no industry-wide framework on how analytics and ML/AI should be applied to various business and operational processes. And now, most of these processes already have some flavor of analytics built in, but it remains unclear how these various analytical capabilities complement each other in a meaningful way.
Q) What amount of subject matter expertise is required to develop the algorithms that drive real value?
Dima: From our experience, in a carrier environment, you cannot just put a generic analytics layer on top of the data and expect it to really drive value. It requires deep subject matter expertise and the detailed knowledge of the use cases that telecom operators require, as well as practical understanding of the harsh reality of the relevant data availability and quality.
Like with any other ML application, it requires a lot of trial and error. Over time we’ve identified the right set of algorithms that allow us to leverage our insights across the board in different carrier environments. At the same time, these algorithms and their fine-tuning are constantly evolving because we always keep learning.
At the end of the day, we can replicate the same success stories in more than one environment. While the operator use cases are similar and the desired outcome is often the same, where we achieve optimal performance is through fine tuning the algorithms – and that’s what we are focused on. Each customer is unique.
Q) What does it take to achieve the most value?
Dima: The goal of any analytics initiative should always be about more than a nice dashboard and a few reports. Operators must first understand what it is that they are trying to achieve. It’s particularly important to align internally so that everyone is okay with the intended outcomes, especially as the industry moves towards closed loops where decisions will be made automatically. The real work isn’t the analytics- it comes when you need to decide what to do with the analytics, how to operationalize the outcomes and make it part of your on-going processes. It’s not merely a one-time exercise.
Q) Have there been improvements in what you’ve been able to deliver, in terms of value, over the last year and a half or so?
Dima: We have developed capabilities that are ahead of the market. My observation is that the generic analytics platforms that aim to do everything for everyone are not at the same level of maturity when it comes to the telecom industry. The best outcomes are achieved use case by use case, department by department, and our results support this observation.
Helix is TEOCO’s service assurance platform, and Helix Analytics is an analytics layer that we’ve been developing over the past few years. It uses machine learning to fully automate the root cause analysis of network faults, and it has achieved amazing results. What is especially exciting is that we can use this analytics layer on top of other service assurance solutions. It doesn’t have to be ours.
Q) Why is this an important differentiator? Why wouldn’t operators just switch to a more modern service assurance solution?
Dima: The big difference is that we are not forcing operators to rip and replace the fault and performance management solutions they already have in place. We have invested heavily in developing an analytics layer that is decoupled from our own application. Operators can keep their existing solution in place, and Helix Analytics becomes an analytical layer on top. By taking this approach, operators have a path to modernize their service assurance solution by transitioning to our full Helix platform if and when they desire, but they are not forced to do it in order to enjoy the value of our analytics capabilities. At the same time, if an operator chooses to use a more generic analytics layer, there is no real path forward when it comes to service assurance modernization by leveraging the same platform.
Q) From a technical standpoint, how is this enabled?
Dima: We’ve been able to do this through the evolution of our software architecture that has moved towards using more decoupled components, and a focused investment in APIs as well as the use of an Apache Hadoop framework; specifically, Kafka in this case.
Kafka is an open source message bus. It has become the de-facto standard in the industry for exchanging live streaming data – including network events, alarms, and telemetry. With Kafka, integration timelines, and the cost and effort associated with them, go down dramatically.
Our Helix Analytics solution can run on premises or over the cloud. If it is the latter, this means we do not need to deploy a single piece of hardware or software in the operator’s environment. They can simply send us their alarms in real time, we provide the machine learning-driven root cause analysis and alarm correlations, then we send back clusters of alarms which are then used for automated trouble-ticketing.
Q) Why would an operator go through this complex exercise?
Dima: One possible outcome of Helix Analytics is dramatically fewer trouble tickets for the operator. Fewer trouble tickets mean lower mean-time-to-resolution (MTTR) because you get to the root cause much faster.
A lot of redundant work goes into investigating network alarms, and this is what our analytics layer helps to mitigate. For example, let’s say there is a failure in the network and this failure has generated thousands of alarms. If an operator performs a traditional rules-based alarm correlation to detect the problem, they may end up with several trouble tickets describing the same problem. At first glance this looks great – thousands of alarms have been reduced to a dozen tickets, but in our opinion that’s not good enough. It may take another several hours of manual investigation work to get to the right root-cause and start fixing the problem
Q) What is the ROI of machine learning in relation to service assurance?
Dima: There are many examples of measurable outcomes from using machine learning for service assurance – both in financial and operational terms. But if we build on my previous example of automated root-cause analysis, with fewer redundant trouble tickets you’re working on fewer problems.
A significant reduction in trouble tickets can equate to a reduction in hundreds of thousands of man-hours, which translates into millions of dollars in annual OPEX savings. This results in immediate savings through eliminated waste and improved efficiency. It also directly translates into lower MTTR, decreased network downtime and better service quality.
It’s important to mention that we are not just talking about marginal improvements – we have cases where we have helped to deliver more than a 50% reduction in trouble ticket volumes. We have also achieved a comparable increase in trouble ticket accuracy, which has led to tens of millions of dollars in annual savings.
Q) Are there any advantages from a more forward looking perspective? Will this help to prepare an operator’s network for the future?
Dima: The notion of decoupling the analytics layer from our service assurance solution is exactly along those lines. It enables service providers to deploy their solutions in a more decoupled manner with each component serving different needs and delivering different outcomes.
In a 5G-driven environment, we will be surrounded by many new network and service delivery chain components that will need to be monitored and managed. Some parts of Service Assurance will need to expand to the edge. Maybe today, collectively as an industry, we cannot define exactly what that is going to look like, but we are putting the infrastructure in place to be able to support those use cases in a microservices oriented environment through our service assurance application.
Q) Is there a key takeaway you’d like to share as we wrap up our discussion?
Dima: At the end of the day, we are focused on improving the way that operators look at their network and enhancing their productivity so they can do more with less. We are not trying to be everything for everyone when it comes to machine learning, AI and automation. We are focusing on specific ROI-driven use cases that we know well and have proven to be successful.
CSPs will not be able to monetize the 5G promise and meet high customer expectations if they don’t have advanced analytics and modern service assurance in place.
Dima Alkin, VP Service Assurance
Join Dima Alkin at the Sept. 23rd FutureNet Virtual Summit for the Day 2 panel discussion: How will orchestration work across all layers: transport, core, access? Or visit the TEOCO website for more information on the Helix Service Assurance platform and Helix Analytics.