The Value of Fault Management Analytics, Part I

Putting smart analytical capabilities into the hands of the operational and engineering users to ‘turbo-charge’ modern service assurance.

For network engineers, the rapid introduction of new network technologies and services, the growing eruption of IoT eco-systems as shown in the graph below, and the criticality of maintaining the best customer experience in today’s highly competitive market, requires close monitoring of the ever-growing amount of network alarms.

Analytical algorithms help with the prompt resolution of network issues, and the reduction of fault discovery times, but only up to the prevention of outages. Paired with service assurance, advanced analytics empowers engineers and operational users, turning an essentially reactive activity, into proactive.

Boosting network efficiency and flexibility are the goals expected to be achieved within the current Network and Service Operations Centers (N/SOC) budgetary and head-count limits.  Smarter infrastructure and new capabilities need to be introduced into the current operational framework in order for advanced analytical algorithms and techniques to digest the extra information and automatically harvest actionable knowledge.

Smart Analytics have multiple use cases and benefits across both operations and engineering departments, but it is in the N/SOC environment that we can start measuring the operational efficiency improvements and the reductions of mean time to repair (MTTR).   As sophisticated analytical algorithms can be seamlessly integrated into the N/SOC operatives’ daily tasks, they empower users to quickly identify network disruptions and even prevent them from happening.

Two major themes give us the right lens to look at the impact of analytical algorithms on operations:

  • Automatic Root-Cause Analysis
  • Predictive Failure Identification

Analytics enhanced RCA is the new rule
Root-cause analysis (RCA) is a critical component of any N/SOC users’ toolkit. By using pre-defined rules to correlate different alarms and determine the root-cause of problems occurring in the network, the amount of alarms to be managed are reduced, and the efforts of N/SOC users are more focused on the real cause of the problem. Ultimately, RCA contributes to the reduction of repair times.

But isn’t there a risk that these pre-definitions are creating limitations?  Even though rules are updated constantly to focus on the most common and known relationships between alarms, how can root cause analysis offer a lasting solution?

With the amount of new services that are introduced and quickly become part of our digital way of life, new network elements, end devices and “things” are constantly being added to the digital environment that operators need to manage. Regularly updating and defining new sets of rules that will take into account the inter-relationships between all these new entities is a huge challenge.

Automatic RCA analytical algorithms add another level of automation to fault management, which extends rule-based RCA with more dynamic and adaptive mechanisms. Our algorithms study and analyze the stream of alarms reaching the system, suggest relationships, and tag the potential root-cause alarms among them.

Such mechanisms can significantly improve the identification of parent alarms (i.e. the root-cause) in scenarios that are not pre-defined and where new elements may have been introduced to the network. In terms of N/SOC efficiency, automatic RCA reduces the amount of alarms that controllers need to manage, and assists in fixing the problems identified.

As a result, the identification of ‘parent alarms’ in scenarios that are not pre-defined and that may introduce new elements into the network is definitely improved, making the impact of automatic RCA much more lasting.

Empowering the NSOC users with automation
The value of Automatic RCA does not end with the identification of the root-cause. Once it’s identified, it still needs to be repaired. By feeding the analysis into other applications, a set of automated corrective actions can be taken to solve the problem quickly and automatically. But can this work effectively? Are NSOC staff empowered or disrupted?

From what we have witnessed over two decades of providing operational and analytical solutions to N/SOC and engineering teams, analytics reduces the amount of alarms controllers need to manage, and assists in fixing the heart of the problems identified within the network. Overall, leveraging the latest tools and analytical algorithms helps service providers achieve their efficiency goals by automatically gathering all of the relevant information and providing the insight needed to better manage today’s evolving networks.

Look out for Part II of The Value of Fault Management Analytics, where we will share another smart analytical capability: Predictive Failure Identification.

Click here for more information on Helix, TEOCO’s Fault Management Solution.