This article was written by Dima Alkin, VP North America, Service Assurance at TEOCO and originally appeared on sdx central

Several major communication service providers are advancing towards network function virtualization (NFV)-enabled architectures, which will help cut costs and provide their customers with more agile, on-demand flexible services, including entirely new “anything-as-a-service” offerings. But this major technological shift is creating a set of service assurance-related operational issues, which also need to be addressed.

The NFV-enabled environment is changing the way we think about network management applications, as it is becoming evident that the days of proprietary mega-solutions, which “do everything for everyone” are over. In this evolving environment, we will see operators deploying a mix of loosely-coupled software components from various sources; some developed in-house, some available commercially-off-the-shelf, and some open source. The critical factor is that they all need to be able to communicate through industry-defined open application program interfaces (APIs) and exchange information in real-time, to support network management and service assurance functions in a more agile and flexible way than ever before.

This new reality is pushing the innovators in the operations support systems (OSS)/business support systems (BSS) industry to rethink traditional approaches and come up with an entirely new set of tools and methods to address both old and new service assurance challenges. For example, some traditional fault management use cases include reducing mean-time-to-repair (MTTR) and improving network operating center (NOC) efficiencies through advanced alarm filtering, secondary alarms suppression, and root-cause analysis.

Many of the traditional approaches to solving those issues, such as expert-system, rules-driven root cause analysis (RCA), or even the more advanced topology-based correlation, will simply not be enough in an NFV environment. An expert-systems driven approach will not be able to accurately anticipate and prevent network behavior and alarm propagation issues via a rules-based engine. A topology-based approach will have limited and unreliable results as well, due to the very dynamic nature of an NFV-enabled network. It is also important to note that analytics, when applied to fault management, need to provide results with a very high level of confidence. Simply pointing to a possible root-cause is not enough, and can in fact, be detrimental. Due to the critical and real-time nature of the NOC’s role in keeping the network up and running, it would actually be better to go back to the manual, time-consuming investigations than trying to fix a wrong root-cause issue provided by the most advanced RCA engine. But clearly, that is not the answer either. To realize the true benefits of NFV, we need to move towards greater automation, but that will only happen with proven reliability.

It’s safe to say that all these challenges are creating the need for new approaches. Investing in new-generation analytical capabilities that are optimized for today’s hybrid NFV environments will help CSPs to better realize the full value of their NFV investments. An example of such advancement is utilizing natural-language processing algorithms for eliminating data normalization and clean-up requirements in alarm data, and using machine-learning techniques to support advanced correlation and RCA, without the need to augment alarm data with network topology and reference information, and so on. This typically becomes an inhibitor to an analytics project’s success, as the data often isn’t readily available or requires a significant integration effort. Our recent efforts in this area have surpassed even our own expectations. By exclusively using alarm history data, we have been able to accurately cluster alarms and identify root causes with greater levels of certainty than ever before, and then feed those results in real-time into the fault management systems; enabling faster, more accurate resolution of network failures.

Another unique requirement of this new reality is the need to manage and assure the performance of both physical and logical components in the hybrid “pre-NFV” and NFV environments. In addition to the widely accepted fact that transitioning networks to software-defined networking (SDN) and NFV is going to be gradual and will take time, many network components are not going anywhere at all – one can’t virtualize fiber or a radio antenna. It is likely that this kind of hybrid network will become the norm for at least the next several years, and carriers will need more than a temporary patch to manage and assure the performance of all their underlying network components – physical, virtual, and abstract alike. Service assurance solutions, therefore, need to be able to be “network-aware” across all the domains to enable an end-to-end service performance view, combining traditional mature data models like TM Forum’s shared information/data model (SID) with more cloud and NFV-focused models like OASIS’s topology and orchestration specification for cloud applications (TOSCA).

These are just a few practical examples of how the latest technology shift is driving innovation in today’s service assurance solutions, with a goal of providing better support for the operational aspects of network management and a smoother transition to an NFV infrastructure. We see many industry leaders who are making this a high priority and are using this as an opportunity to entirely rethink their OSS environment and approach, but there are also those carriers who are more cautious, given the complexity of their OSS environment and concern over their organization’s readiness to accept such aggressive changes. As carriers are evaluating their service assurance strategy, whatever approach they decide to take, whether they are deploying a home-grown system, open source solution, or commercial off-the-shelf product, we encourage them to take a close look at three key components – automation, analytics and APIs. Without these, it will be essentially impossible to monitor and assure the performance of today’s hybrid networks.