In this interview, Dima Alkin, TEOCO’s Vice President of Service Assurance, discusses new technologies for discovering and managing network faults and the future impact on the industry.

Q) The way network faults are detected and managed has evolved over the years. What has this evolution looked like, and what has been the impact of machine learning?

Dima: Initially, network fault management was a time consuming, manual process.

Each service provider’s NOC (network operations center) employed teams of subject matter experts who examined each issue as it appeared, and would study the network topology, maintenance schedule, external factors etc. until each problem was understood and could be resolved. In effect, the data was manually correlated until the root cause of the problem could be identified. Operators did it this way for decades, and for the most part it worked well. But it was costly, time consuming, and of course it was highly dependent upon the expertise of network engineers.

The next step utilizes a rules-based approach, and this is what has been the industry standard for several years now. With rules-based service assurance, methods for responding to different situations in the network are documented using pre-determined, expert logic. This automates many of the more common or repeatable tasks and processes needed for understanding network failures. Data can be analyzed automatically and, for the most part, reasonable results are achieved.

This has been a good approach, but there are only so many rules you can write, and much of the work still requires manual intervention. Scalability has become the issue. Often, by the time the rules are created, they are outdated. We had to look for a better solution, and that’s where machine learning comes into play.

Q) Is the transition to machine learning challenging for most operators?

Dima: The challenge in transitioning from these earlier approaches towards machine learning and automation isn’t in the technology anymore. The technology, for the most part, is available and constantly evolving. The challenge is that human decision making is becoming less and less a part of the process, and this can be unnerving. ‘Believability’ in the results is a challenge with machine learning for any industry, but especially in telecom where there is a lot at stake, and where any potential ML/AI malfunction can directly impact the customer. No one wants their network to go down and for millions of customers to lose service because of a wrongly executed change driven by ML-based decisions. But, due to the high of complexity of today’s networks and the need to scale, operators must now put their trust into software solutions that automate decisions that used to be made by people with years of experience. If I’m someone who has been managing networks for twenty years, this can be a hard pill to swallow, but there is simply no viable alternative. Even if you don’t care about the cost, and most carriers do, there are only so many people you can hire to manage all the alarms in the NOC, or to manually process an endless amount of network performance reports and KPIs.

At TEOCO, we understand this, and that’s why we work extremely hard to help explain the outcomes of our solutions to our customers. We can’t just say that our algorithms help them operate better and faster and expect operators to believe us. First, we must prove ourselves and earn their trust. One way we do this is by providing visual indicators that help explain the results. Another way is by working with operators to develop and fine tune the algorithms for their specific needs – we become partners in this effort.

To extract real value from machine learning requires a lot of specialized subject matter expertise.  We bring a lot of that to the table, but for the best results, a collaborative effort that involves the customer is critical. Developing the right feature sets happens by asking our customers the right questions. This becomes an exercise in trust building as well. Customer buy-in is a key part of the process.

Q) Why is subject matter expertise still an important part of machine learning?  Isn’t ML supposed to replace that?

Dima: I think this is something that is misunderstood in the industry.  Many people assume you can take a generic machine-learning algorithm, point it at some dataset, and walk away with all the answers to every question you could ever ask.  For machine learning to deliver real value, the right features and vectors need to be defined and developed before the ‘learning’ part of ML can begin.  In some areas, like language processing, these features have been well developed and the ML tools are widely available.  But for complex telecom networks, this isn’t the case.  That’s why operators need to steer clear of the big-name, generic machine learning solutions for specialized use cases like service assurance.  The area of telecom network management and monitoring is so highly specialized and bespoke that the results from generic ML products typically don’t deliver much real value.

Q) Are there any areas of ML that are more difficult to explain, or more challenging for an operator to synthesize, than others?

Dima: Yes, we see two types of use cases, at least conceptually. First, there is where machine learning and AI are used to do the same things we’ve been doing for many years, either through a rules-driven approach or manually. In this type of use cases, the outcomes are well understood, we are just trying to get to the same or better results by leveraging ML and AI. Root-cause analysis and alarms correlation are good examples of this.

The second type of use case is one is that is much harder to digest. It is where the use of ML and AI provides new opportunities to do things that we have never attempted to do in the past.  Predictive analytics is a good example.

Predictive analytics is completely new in the realm of service assurance. In the past, nobody really tried to make predictions manually, at least not at scale. It’s just not sustainable. You cannot analyze such large amount of data manually and gain these types of insights. So, in a way, this is something that is quite different – both from a technology perspective, and even more importantly, from an organizational perspective.

Q) What is a use case example of predictive analytics in action?

Dima: For example, TEOCO’s HELIX fault management analytics layer  can accurately predict network and equipment failures by applying unsupervised machine learning on historical data. To most operators, who are constantly trying to resolve issues in the network as they happen, this sounds almost too good to be true – but this is exactly what we can deliver today with our technology.  We can often predict that a certain piece of equipment is going to fail within a specific period, with an extremely high level of precision and recall.

What we’ve learned is that while most CSPs are keenly interested in having these kinds of insights, they don’t yet have the processes in place to support them. They will have ways to manage existing alarms and faults and trouble tickets – but no way to manage a network failure that hasn’t even happened yet. The technology is there, but not the processes. This gap will need to be addressed for network management automation to be successful.

Q) What does the future of predictive service assurance look like, and do you have any recommendations on how service providers can get there?

Dima: Right now, predictive analytics in network fault management is so new that it’s still a bit of a novelty. It needs to become more than that – because predictive insights can help drive better decision making and improved operations.

I think there are four areas that operators need to focus on when implementing predictive analytics for network operations purposes:

1.Predict the things that matter – and don’t worry about the rest
We’ve come to realize that it’s not only about the ability to predict something, it’s about working with the customer and asking, ‘what is it that is most important for your operations and your customers – what do you want to predict?’ You don’t need to predict everything. The goal should be to focus on things where you can take preventative measures, but also where the cure will cost less than the damage of waiting and taking your chances. In some instances, it may make more sense to wait and let it fail, and then deal with it when that happens.

In other words, the cost associated with acting on a predicted failure must be significantly lower than waiting – but cost can mean many things: revenue, customer satisfaction, network security. This all needs to be factored into the decision process – which takes me to my second point.

2.Have the right resources in place
If the Network Operations Center is already dealing with hundreds of problems a day – managing things that are already broken – I’m not going to divert my efforts and resources on something that isn’t yet broken. That’s what this next generation of use cases will be about, and where machine learning and AI create opportunities. But again, the CSP as an organization needs to decide what to do with this information, and that brings us to the third issue.

3.Have the right processes in place
For most CSPs, nobody has the role of responding to predictions. We are all running around putting out fires of things that are broken now. When a network failure happens, there is someone in charge of fixing it. But when you’re capable of predicting failures, nobody is on point to deal with that. As I mentioned earlier, there is no process in place for managing predictions – but everyone still wants the information. When it comes to predictive analytics, I think the human side still needs to catch up with the technology, which brings me to my final point.

4.Build Confidence
The way to respond to predictions is not to follow the same process. You should have a different process, probably much more automated. For this to happen there needs to be an evolution in organizational maturity- in the processes of doing business – because everything is changing, not just the technology.

Q) When will predictive analytics become mainstream?

Dima: As an industry, our goal should be to make managing the network more automated and seamless. It’s the only way we will be able to deliver on the promise of 5G and, more specifically, network slicing. People who are building new 5G networks from scratch see problems coming – and they want to have the information in place to fix faults before they happen.

Our belief is that CSPs will need to have this capability from day one. They will not be able to monetize a lot of the 5G promise, and their promise to their customers, if they don’t have advanced analytics and modern service assurance in place.

Most of the service assurance capabilities TEOCO develops today have two parts. One part is focused on improving existing processes – what is often called legacy automation. And the other part is focused on new use cases in areas like 5G, slicing and multi-edge computing. Maybe today, collectively as an industry, we cannot define exactly what all these new use cases will look like, but we are putting the infrastructure in place to be able to support them through our service assurance application.

Click here for more information on how TEOCO’s service assurance solutions are using machine learning, and artificial intelligence to drive predictive analytics and improve the world’s communication networks.

Dima

Dima Alkin, VP Service Assurance