RFC 9232: Network Telemetry Framework
- H. Song,
- F. Qin,
- P. Martinez-Julia,
- L. Ciavaglia,
- A. Wang
Abstract
Network telemetry is a technology for gaining network insight and facilitating efficient and automated network management. It encompasses various techniques for remote data generation, collection, correlation, and consumption. This document describes an architectural framework for network telemetry, motivated by challenges that are encountered as part of the operation of networks and by the requirements that ensue. This document clarifies the terminology and classifies the modules and components of a network telemetry system from different perspectives. The framework and taxonomy help to set a common ground for the collection of related work and provide guidance for related technique and standard developments.¶
Status of This Memo
This document is not an Internet Standards Track specification; it is published for informational purposes.¶
This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has received public review and has been approved for publication by the Internet Engineering Steering Group (IESG). Not all documents approved by the IESG are candidates for any level of Internet Standard; see Section 2 of RFC 7841.¶
Information about the current status of this document, any
errata, and how to provide feedback on it may be obtained at
https://
Copyright Notice
Copyright (c) 2022 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(https://
1. Introduction
Network visibility is the ability of management tools to see the state and behavior of a network, which is essential for successful network operation. Network telemetry revolves around network data that 1) can help provide insights about the current state of the network, including network devices, forwarding, control, and management planes; 2) can be generated and obtained through a variety of techniques, including but not limited to network instrumentation and measurements; and 3) can be processed for purposes ranging from service assurance to network security using a wide variety of data analytical techniques. In this document, network telemetry refers to both the data itself (i.e., "Network Telemetry Data") and the techniques and processes used to generate, export, collect, and consume that data for use by potentially automated management applications. Network telemetry extends beyond the classical network Operations, Administration, and Management (OAM) techniques and expects to support better flexibility, scalability, accuracy, coverage, and performance.¶
However, the term "network telemetry" lacks an unambiguous definition. The scope and coverage of it cause confusion and misunderstandin
To fulfill such an undertaking, we first discuss some key characteristics of network telemetry that set a clear distinction from the conventional network OAM and show that some conventional OAM technologies can be considered a subset of the network telemetry technologies. We then provide an architectural framework for network telemetry that includes four modules, each associated with a different category of telemetry data and corresponding procedures. All the modules are internally structured in the same way, including components that allow the operator to configure data sources in regard to what data to generate and how to make that available to client applications, components that instrument the underlying data sources, and components that perform the actual rendering, encoding, and exporting of the generated data. We show how the network telemetry framework can benefit current and future network operations. Based on the distinction of modules and function components, we can map the existing and emerging techniques and protocols into the framework. The framework can also simplify designing, maintaining, and understanding a network telemetry system. In addition, we outline the evolution stages of the network telemetry system and discuss the potential security concerns.¶
The purpose of the framework and taxonomy is to set a common ground for the collection of related work and provide guidance for future technique and standard developments. To the best of our knowledge, this document is the first such effort for network telemetry in industry standards organizations. This document does not define specific technologies.¶
1.1. Applicability Statement
Large-scale network data collection is a major threat to user privacy and may be indistinguishab
1.2. Glossary
Before further discussion, we list some key terminology and abbreviations used in this document. There is an intended differentiation between the terms of network telemetry and OAM. However, it should be understood that there is not a hard-line distinction between the two concepts. Rather, network telemetry is considered an extension of OAM. It covers all the existing OAM protocols but puts more emphasis on the newer and emerging techniques and protocols concerning all aspects of network data from acquisition to consumption.¶
- AI:
- Artificial Intelligence. In the network domain, AI refers to machine
-learning -based technologies for automated network operation and other tasks.¶ - AM:
- Alternate Marking. A flow performance measurement method, as specified in [RFC8321].¶
- BMP:
- BGP Monitoring Protocol. Specified in [RFC7854].¶
- DPI:
- Deep Packet Inspection. Refers to the techniques that examine packets beyond packet L3/L4 headers.¶
- gNMI:
- gRPC Network Management Interface. A network management protocol from the OpenConfig Operator Working Group, mainly contributed by Google. See [gnmi] for details.¶
- GPB:
- Google Protocol Buffer. An extensible mechanism for serializing structured data. See [gpb] for details.¶
- gRPC:
- gRPC Remote Procedure Call. An open-source high
-performance RPC framework that gNMI is based on. See [grpc] for details.¶ - IPFIX:
- IP Flow Information Export Protocol. Specified in [RFC7011].¶
- IOAM:
- In situ OAM [RFC9197]. A data plane on-path telemetry technique.¶
- JSON:
- JavaScript Object Notation. An open standard file format and data interchange format that uses human-readable text to store and transmit data objects, as specified in [RFC8259].¶
- MIB:
- Management Information Base. A database used for managing the entities in a network.¶
- NETCONF:
- Network Configuration Protocol. Specified in [RFC6241].¶
- NetFlow:
- A Cisco protocol used for flow record collecting, as described in [RFC3954].¶
- Network Telemetry:
- The process and instrumentation for acquiring and utilizing network data remotely for network monitoring and operation. A general term for a large set of network visibility techniques and protocols, concerning aspects like data generation, collection, correlation, and consumption. Network telemetry addresses current network operation issues and enables smooth evolution toward future intent-driven autonomous networks.¶
- NMS:
- Network Management System. Refers to applications that allow network administrators to manage a network.¶
- OAM:
- Operations, Administration, and Maintenance. A group of network management functions that provide network fault indication, fault localization, performance information, and data and diagnosis functions. Most conventional network monitoring techniques and protocols belong to network OAM.¶
- PBT:
- Postcard-Based Telemetry. A data plane on-path telemetry technique. A representative technique is described in [IPPM
-IOAM ].¶-DIRECT -EXPORT - RESTCONF:
- An HTTP-based protocol that provides a programmatic interface for accessing data defined in YANG, using the datastore concepts defined in NETCONF, as specified in [RFC8040].¶
- SMIv2:
- Structure of Management Information Version 2. Defines MIB objects, as specified in [RFC2578].¶
- SNMP:
- Simple Network Management Protocol. Versions 1, 2, and 3 are specified in [RFC1157], [RFC3416], and [RFC3411], respectively.¶
- XML:
- Extensible Markup Language. A markup language for data encoding that is both human readable and machine readable, as specified by W3C [W3C
.REC ].¶-xml -20081126 - YANG:
- YANG is a data modeling language for the definition of data sent over network management protocols such as NETCONF and RESTCONF. YANG is defined in [RFC6020] and [RFC7950].¶
- YANG ECA:
- A YANG model for Event
-Condition -Action policies, as defined in [NETMOD -ECA ].¶-POLICY - YANG-Push:
- A mechanism that allows subscriber applications to request a stream of updates from a YANG datastore on a network device. Details are specified in [RFC8639] and [RFC8641].¶
2. Background
The term "big data" is used to describe the extremely large volume of data sets that can be analyzed computationally to reveal patterns, trends, and associations. Networks are undoubtedly a source of big data because of their scale and the volume of network traffic they forward. When a network's endpoints do not represent individual users (e.g., in industrial, data-center, and infrastructure contexts), network operations can often benefit from large-scale data collection without breaching user privacy.¶
Today, one can access advanced big data analytics capability through a plethora of commercial and open-source platforms (e.g., Apache Hadoop), tools (e.g., Apache Spark), and techniques (e.g., machine learning). Thanks to the advance of computing and storage technologies, network big data analytics give network operators an opportunity to gain network insights and move towards network autonomy. Some operators start to explore the application of Artificial Intelligence (AI) to make sense of network data. Software tools can use the network data to detect and react on network faults, anomalies, and policy violations, as well as predict future events. In turn, the network policy updates for planning, intrusion prevention, optimization, and self-healing may be applied.¶
It is conceivable that an autonomic network [RFC7575] is the logical next step for network evolution following Software
However, while the data processing capability is improved and applications require more data to function better, the networks lag behind in extracting and translating network data into useful and actionable information in efficient ways. The system bottleneck is shifting from data consumption to data supply. Both the number of network nodes and the traffic bandwidth keep increasing at a fast pace. The network configuration and policy change at smaller time slots than before. More subtle events and fine-grained data through all network planes need to be captured and exported in real time. In a nutshell, it is a challenge to get enough high-quality data out of the network in a manner that is efficient, timely, and flexible. Therefore, we need to survey the existing technologies and protocols and identify any potential gaps.¶
In the remainder of this section, we first clarify the scope of network data (i.e., telemetry data) relevant in this document. Then, we discuss several key use cases for network operations of today and the future. Next, we show why the current network OAM techniques and protocols are insufficient for these use cases. The discussion underlines the need for new methods, techniques, and protocols, as well as the extensions of existing ones, which we assign under the umbrella term "Network Telemetry".¶
2.1. Telemetry Data Coverage
Any information that can be extracted from networks (including the data plane, control plane, and management plane) and used to gain visibility or as a basis for actions is considered telemetry data. It includes statistics, event records and logs, snapshots of state, configuration data, etc. It also covers the outputs of any active and passive measurements [RFC7799]. In some cases, raw data is processed in network before being sent to a data consumer. Such processed data is also considered telemetry data. The value of telemetry data varies. In some cases, if the cost is acceptable, less but higher-quality data are preferred rather than a lot of low-quality data. A classification of telemetry data is provided in Section 3. To preserve the privacy of end users, no user packet content should be collected. Specifically, the data objects generated, exported, and collected by a network telemetry application should not include any packet payload from traffic associated with end-user systems.¶
2.2. Use Cases
The following set of use cases is essential for network operations. While the list is by no means exhaustive, it is enough to highlight the requirements for data velocity, variety, volume, and veracity, the attributes of big data, in networks.¶
2.3. Challenges
For a long time, network operators have relied upon SNMP [RFC3416], Command-Line Interface (CLI), or Syslog [RFC5424] to monitor the network. Some other OAM techniques as described in [RFC7276] are also used to facilitate network troubleshooting
These challenges were addressed by newer standards and techniques (e.g., IPFIX/Netflow, Packet Sampling (PSAMP), IOAM, and YANG-Push), and more are emerging. These standards and techniques need to be recognized and accommodated in a new framework.¶
2.4. Network Telemetry
Network telemetry has emerged as a mainstream technical term to refer to the network data collection and consumption techniques. Several network telemetry techniques and protocols (e.g., IPFIX [RFC7011] and gRPC [grpc]) have been widely deployed. Network telemetry allows separate entities to acquire data from network devices so that data can be visualized and analyzed to support network monitoring and operation. Network telemetry covers the conventional network OAM and has a wider scope. For instance, it is expected that network telemetry can provide the necessary network insight for autonomous networks and address the shortcomings of conventional OAM techniques.¶
Network telemetry usually assumes machines as data consumers rather than human operators. Hence, network telemetry can directly trigger the automated network operation, while in contrast, some conventional OAM tools were designed and used to help human operators to monitor and diagnose the networks and guide manual network operations. Such a proposition leads to very different techniques.¶
Although new network telemetry techniques are emerging and subject to continuous evolution, several characteristics of network telemetry have been well accepted. Note that network telemetry is intended to be an umbrella term covering a wide spectrum of techniques, so the following characteristics are not expected to be held by every specific technique.¶
In addition, an ideal network telemetry solution may also have the following features or properties:¶
It is worth noting that a network telemetry system should not be intrusive to normal network operations by avoiding the pitfall of the "observer effect". That is, it should not change the network behavior and affect the forwarding performance. Moreover, high-volume telemetry traffic may cause network congestion unless proper isolation or traffic engineering techniques are in place, or congestion control mechanisms ensure that telemetry traffic backs off if it exceeds the network capacity. [RFC8084] and [RFC8085] are relevant Best Current Practices (BCPs) in this space.¶
Although in many cases a system for network telemetry involves a remote data collecting and consuming entity, it is important to understand that there are no inherent assumptions about how a system should be architected. While a network architecture with a centralized controller (e.g., SDN) seems to be a natural fit for network telemetry, network telemetry can work in distributed fashions as well. For example, telemetry data producers and consumers can have a peer-to-peer relationship, in which a network node can be the direct consumer of telemetry data from other nodes.¶
2.5. The Necessity of a Network Telemetry Framework
Network data analytics (e.g., machine learning) is applied for network operation automation, relying on abundant and coherent data from networks. Data acquisition that is limited to a single source and static in nature will in many cases not be sufficient to meet an application's telemetry data needs. As a result, multiple data sources, involving a variety of techniques and standards, will need to be integrated. It is desirable to have a framework that classifies and organizes different telemetry data sources and types, defines different components of a network telemetry system and their interactions, and helps coordinate and integrate multiple telemetry approaches across layers. This allows flexible combinations of data for different applications, while normalizing and simplifying interfaces. In detail, such a framework would benefit the development of network operation applications for the following reasons:¶
A telemetry framework collects all the telemetry
3. Network Telemetry Framework
The top-level network telemetry framework partitions the network telemetry into four modules based on the telemetry data object source and represents their relationship. Once the network operation applications acquire the data from these modules, they can apply data analytics and take actions. At the next level, the framework decomposes each module into separate components. Each of these modules follows the same underlying structure, with one component dedicated to the configuration of data subscriptions and data sources, a second component dedicated to encoding and exporting data, and a third component instrumenting the generation of telemetry related to the underlying resources. Throughout the framework, the same set of abstract data-acquiring mechanisms and data types (Section 3.3) are applied. The two-level architecture with the uniform data abstraction helps accurately pinpoint a protocol or technique to its position in a network telemetry system or disaggregates a network telemetry system into manageable parts.¶
3.1. Top-Level Modules
Telemetry can be applied on the forwarding plane, control plane, and management plane in a network, as well as on other sources out of the network, as shown in Figure 1. Therefore, we categorize the network telemetry into four distinct modules (management plane, control plane, forwarding plane, and external data and event telemetry) with each having its own interface to network operation applications.¶
The rationale of this partition lies in the different telemetry data objects that result in different data sources and export locations. Such differences have profound implications on in-network data programming and processing capability, data encoding and the transport protocol, and required data bandwidth and latency. Data can be sent directly or proxied via the control and management planes. There are advantages
Note that in some cases, the network controller itself may be the source of telemetry data that is unique to it or derived from the telemetry data collected from the network elements. Some of the principles and taxonomy specific to the control plane and management plane telemetry could also be applied to the controller when it is required to provide the telemetry data to network operation applications hosted outside. The scope of this document is focused on the network elements telemetry, and further details related to controllers are thus out of scope.¶
We summarize the major differences of the four modules in Table 1. They are compared from six angles:¶
Data Object is the target and source of each module. Because the data source varies, the location where data is mostly conveniently exported also varies. For example, forwarding plane data mainly originates as data exported from the forwarding Application
Note that the interaction with the applications that consume network telemetry data can be indirect. Some in-device data transfer is possible. For example, in the management plane telemetry, the management plane will need to acquire data from the data plane. Some operational states can only be derived from data plane data sources such as the interface status and statistics. As another example, obtaining control plane telemetry data may require the ability to access the Forwarding Information Base (FIB) of the data plane.¶
On the other hand, an application may involve more than one plane and interact with multiple planes simultaneously. For example, an SLA compliance application may require both the data plane telemetry and the control plane telemetry.¶
The requirements and challenges for each module are summarized as follows (note that the requirements may pertain across all telemetry modules; however, we emphasize those that are most pronounced for a particular plane).¶
3.1.1. Management Plane Telemetry
The management plane of network elements interacts with the Network Management System (NMS) and provides information such as performance data, network logging data, network warning and defects data, and network statistics and state data. The management plane includes many protocols, including the classical SNMP and syslog. Regardless the protocol, management plane telemetry must address the following requirements:¶
3.1.2. Control Plane Telemetry
The control plane telemetry refers to the health condition monitoring of different network control protocols at all layers of the protocol stack. Keeping track of the operational status of these protocols is beneficial for detecting, localizing, and even predicting various network issues, as well as for network optimization, in real time and with fine granularity. Some particular challenges and issues faced by the control plane telemetry are as follows:¶
Note that the requirement and solutions for network congestion avoidance are also applicable to the control plane telemetry.¶
3.1.3. Forwarding Plane Telemetry
An effective forwarding plane telemetry system relies on the data that the network device can expose. The quality, quantity, and timeliness of data must meet some stringent requirements. This raises some challenges for the network data plane devices where the first-hand data originates.¶
Although not specific to the forwarding plane, these challenges are more difficult for the forwarding plane because of the limited resources and flexibility. Data plane programmability is essential to support network telemetry. Newer data plane forwarding chips are equipped with advanced telemetry features and provide flexibility to support customized telemetry functions.¶
Technique Taxonomy: This pertains to how one instruments the telemetry; there can be multiple possible dimensions to classify the forwarding plane telemetry techniques.¶
3.1.4. External Data Telemetry
Events that occur outside the boundaries of the network system are another important source of network telemetry. Correlating both internal telemetry data and external events with the requirements of network systems, as presented in [NMRG
As with other sources of telemetry information, the data and events must meet strict requirements, especially in terms of timeliness, which is essential to properly incorporate external event information into network management applications. The specific challenges are described as follows:¶
Organizing both internal and external telemetry information together will be key for the general exploitation of the management possibilities of current and future network systems, as reflected in the incorporation of cognitive capabilities to new hardware and software (virtual) elements.¶
3.2. Second-Level Function Components
The telemetry module at each plane can be further partitioned into five distinct conceptual components:¶
3.3. Data Acquisition Mechanism and Type Abstraction
Broadly speaking, network data can be acquired through subscription (push) and query (poll). A subscription is a contract between publisher and subscriber. After initial setup, the subscribed data is automatically delivered to registered subscribers until the subscription expires. There are two variations of subscription. The subscriptions can be predefined, or the subscribers are allowed to configure and tailor the published data to their specific needs.¶
In contrast, queries are used when a client expects immediate and one-off feedback from network devices. The queried data may be directly extracted from some specific data source or synthesized and processed from raw data. Queries work well for interactive network telemetry applications.¶
In general, data can be pulled (i.e., queried) whenever needed, but in many cases, pushing the data (i.e., subscription) is more efficient, and it can reduce the latency of a client detecting a change. From the data consumer point of view, there are four types of data from network devices that a telemetry data consumer can subscribe or query:¶
The above telemetry data types are not mutually exclusive. Rather, they are often composite. Derived data is composed of simple data; event-triggered data can be simple or derived; and streaming data can be based on some recurring event. The relationships of these data types are illustrated in Figure 3.¶
Subscription usually deals with event-triggered data and streaming data, and query usually deals with simple data and derived data. But the other ways are also possible. Advanced network telemetry techniques are designed mainly for event-triggered or streaming data subscription and derived data query.¶
3.4. Mapping Existing Mechanisms into the Framework
The following table shows how the existing mechanisms (mainly published in IETF and with the emphasis on the latest new technologies) are positioned in the framework. Given the vast body of existing work, we cannot provide an exhaustive list, so the mechanisms in the tables should be considered as just examples. Also, some comprehensive protocols and techniques may cover multiple aspects or modules of the framework, so a name in a block only emphasizes one particular characteristic of it. More details about some listed mechanisms can be found in Appendix A.¶
Although the framework is generally suitable for any network environments, the multi-domain telemetry has some unique challenges that deserve further architectural consideration, which is out of the scope of this document.¶
4. Evolution of Network Telemetry Applications
Network telemetry is an evolving technical area. As the network moves towards the automated operation, network telemetry applications undergo several stages of evolution, which add a new layer of requirements to the underlying network telemetry techniques. Each stage is built upon the techniques adopted by the previous stages plus some new requirements.¶
- Stage 0 - Static Telemetry:
- The telemetry data source and type are determined at design time. The network operator can only configure how to use it with limited flexibility.¶
- Stage 1 - Dynamic Telemetry:
- The custom telemetry data can be dynamically programmed or configured at runtime without interrupting the network operation, allowing a trade-off among resource, performance, flexibility, and coverage.¶
- Stage 2 - Interactive Telemetry:
- The network operator can continuously customize and fine tune the telemetry data in real time to reflect the network operation's visibility requirements. Compared with Stage 1, the changes are frequent based on the real-time feedback. At this stage, some tasks can be automated, but human operators still need to sit in the middle to make decisions.¶
- Stage 3 - Closed-Loop Telemetry:
- The telemetry is free from the interference of human operators, except for generating the reports. The intelligent network operation engine automatically issues the telemetry data requests, analyzes the data, and updates the network operations in closed control loops.¶
Existing technologies are ready for Stages 0 and 1. Individual applications for Stages 2 and 3 are also possible now. However, the future autonomic networks may need a comprehensive operation management system that works at Stages 2 and 3 to cover all the network operation tasks. A well-defined network telemetry framework is the first step towards this direction.¶
5. Security Considerations
The complexity of network telemetry raises significant security implications. For example, telemetry data can be manipulated to exhaust various network resources at each plane as well as the data consumer; falsified or tampered data can mislead the decision-making process and paralyze networks; and wrong configuration and programming for telemetry is equally harmful. The telemetry data is highly sensitive, which exposes a lot of information about the network and its configuration. Some of that information can make designing attacks against the network much easier (e.g., exact details of what software and patches have been installed) and allows an attacker to determine whether a device may be subject to unprotected security vulnerabilities
Given that this document has proposed a framework for network telemetry and the telemetry mechanisms discussed are more extensive (in both message frequency and traffic amount) than the conventional network OAM concepts, we must also anticipate that new security considerations that may also arise. A number of techniques already exist for securing the forwarding plane, control plane, and management plane in a network, but it is important to consider if any new threat vectors are now being enabled via the use of network telemetry procedures and mechanisms.¶
This document proposes a conceptual architectural for collecting, transporting, and analyzing a wide variety of data sources in support of network applications. The protocols, data formats, and configurations chosen to implement this framework will dictate the specific security considerations. These considerations may include:¶
Some security considerations highlighted above may be minimized or negated with policy management of network telemetry. In a network telemetry deployment, it would be advantageous to separate telemetry capabilities into different classes of policies, i.e., Role-Based Access Control and Event
Further study of the security issues will be required, and it is expected that the security mechanisms and protocols are developed and deployed along with a network telemetry system.¶
6. IANA Considerations
This document has no IANA actions.¶
7. Informative References
- [gnmi]
-
Shakir, R., Shaikh, A., Borman, P., Hines, M., Lebsack, C., and C. Marrow, "gRPC Network Management Interface", IETF 98, , <https://
datatracker >..ietf .org /meeting /98 /materials /slides -98 -rtgwg -gnmi -intro -draft -openconfig -rtgwg -gnmi -spec -00 - [gpb]
-
Google Developers, "Protocol Buffers", <https://
developers >..google .com /protocol -buffers - [grpc]
- gRPC, "gPPC: A high performance, open source universal RPC framework", <https://grpc.io>.
- [IPPM
-IOAM -DIRECT -EXPORT] -
Song, H., Gafni, B., Zhou, T., Li, Z., Brockners, F., Bhandari, S., Ed., Sivakolundu, R., and T. Mizrahi, Ed., "In-situ OAM Direct Exporting", Work in Progress, Internet-Draft, draft
-ietf , , <https://-ippm -ioam -direct -export -07 datatracker >..ietf .org /doc /html /draft -ietf -ippm -ioam -direct -export -07 - [IPPM
-POSTCARD -BASED -TELEMETRY] -
Song, H., Mirsky, G., Filsfils, C., Abdelsalam, A., Zhou, T., Li, Z., Mishra, G., Shin, J., and K. Lee, "In-Situ OAM Marking-based Direct Export", Work in Progress, Internet-Draft, draft
-song , , <https://-ippm -postcard -based -telemetry -12 datatracker >..ietf .org /doc /html /draft -song -ippm -postcard -based -telemetry -12 - [NETCONF
-DISTRIB -NOTIF] -
Zhou, T., Zheng, G., Voit, E., Graf, T., and P. Francois, "Subscription to Distributed Notifications", Work in Progress, Internet-Draft, draft
-ietf , , <https://-netconf -distributed -notif -03 datatracker >..ietf .org /doc /html /draft -ietf -netconf -distributed -notif -03 - [NETCONF
-UDP -NOTIF] -
Zheng, G., Zhou, T., Graf, T., Francois, P., Feng, A. H., and P. Lucente, "UDP-based Transport for Configured Subscriptions", Work in Progress, Internet-Draft, draft
-ietf , , <https://-netconf -udp -notif -05 datatracker >..ietf .org /doc /html /draft -ietf -netconf -udp -notif -05 - [NETMOD
-ECA -POLICY] -
Wu, Q., Bryskin, I., Birkholz, H., Liu, X., and B. Claise, "A YANG Data model for ECA Policy Management", Work in Progress, Internet-Draft, draft
-ietf , , <https://-netmod -eca -policy -01 datatracker >..ietf .org /doc /html /draft -ietf -netmod -eca -policy -01 - [NMRG
-ANTICIPATED -ADAPTATION] -
Martinez-Julia, P., Ed., "Exploiting External Event Detectors to Anticipate Resource Requirements for the Elastic Adaptation of SDN/NFV Systems", Work in Progress, Internet-Draft, draft
-pedro , , <https://-nmrg -anticipated -adaptation -02 datatracker >..ietf .org /doc /html /draft -pedro -nmrg -anticipated -adaptation -02 - [NMRG
-IBN -CONCEPTS -DEFINITIONS] -
Clemm, A., Ciavaglia, L., Granville, L. Z., and J. Tantsura, "Intent-Based Networking - Concepts and Definitions", Work in Progress, Internet-Draft, draft
-irtf , , <https://-nmrg -ibn -concepts -definitions -09 datatracker >..ietf .org /doc /html /draft -irtf -nmrg -ibn -concepts -definitions -09 - [OPSAWG-DNP4IQ]
-
Song, H., Ed. and J. Gong, "Requirements for Interactive Query with Dynamic Network Probes", Work in Progress, Internet-Draft, draft
-song , , <https://-opsawg -dnp4iq -01 datatracker >..ietf .org /doc /html /draft -song -opsawg -dnp4iq -01 - [OPSAWG
-IFIT -FRAMEWORK] -
Song, H., Qin, F., Chen, H., Jin, J., and J. Shin, "A Framework for In-situ Flow Information Telemetry", Work in Progress, Internet-Draft, draft
-song , , <https://-opsawg -ifit -framework -17 datatracker >..ietf .org /doc /html /draft -song -opsawg -ifit -framework -17 - [RFC1157]
-
Case, J., Fedor, M., Schoffstall, M., and J. Davin, "Simple Network Management Protocol (SNMP)", RFC 1157, DOI 10
.17487 , , <https:///RFC1157 www >..rfc -editor .org /info /rfc1157 - [RFC2578]
-
McCloghrie, K., Ed., Perkins, D., Ed., and J. Schoenwaelder, Ed., "Structure of Management Information Version 2 (SMIv2)", STD 58, RFC 2578, DOI 10
.17487 , , <https:///RFC2578 www >..rfc -editor .org /info /rfc2578 - [RFC2981]
-
Kavasseri, R., Ed., "Event MIB", RFC 2981, DOI 10
.17487 , , <https:///RFC2981 www >..rfc -editor .org /info /rfc2981 - [RFC3176]
-
Phaal, P., Panchen, S., and N. McKee, "InMon Corporation's sFlow: A Method for Monitoring Traffic in Switched and Routed Networks", RFC 3176, DOI 10
.17487 , , <https:///RFC3176 www >..rfc -editor .org /info /rfc3176 - [RFC3411]
-
Harrington, D., Presuhn, R., and B. Wijnen, "An Architecture for Describing Simple Network Management Protocol (SNMP) Management Frameworks", STD 62, RFC 3411, DOI 10
.17487 , , <https:///RFC3411 www >..rfc -editor .org /info /rfc3411 - [RFC3416]
-
Presuhn, R., Ed., "Version 2 of the Protocol Operations for the Simple Network Management Protocol (SNMP)", STD 62, RFC 3416, DOI 10
.17487 , , <https:///RFC3416 www >..rfc -editor .org /info /rfc3416 - [RFC3877]
-
Chisholm, S. and D. Romascanu, "Alarm Management Information Base (MIB)", RFC 3877, DOI 10
.17487 , , <https:///RFC3877 www >..rfc -editor .org /info /rfc3877 - [RFC3954]
-
Claise, B., Ed., "Cisco Systems NetFlow Services Export Version 9", RFC 3954, DOI 10
.17487 , , <https:///RFC3954 www >..rfc -editor .org /info /rfc3954 - [RFC4656]
-
Shalunov, S., Teitelbaum, B., Karp, A., Boote, J., and M. Zekauskas, "A One-way Active Measurement Protocol (OWAMP)", RFC 4656, DOI 10
.17487 , , <https:///RFC4656 www >..rfc -editor .org /info /rfc4656 - [RFC5085]
-
Nadeau, T., Ed. and C. Pignataro, Ed., "Pseudowire Virtual Circuit Connectivity Verification (VCCV): A Control Channel for Pseudowires", RFC 5085, DOI 10
.17487 , , <https:///RFC5085 www >..rfc -editor .org /info /rfc5085 - [RFC5357]
-
Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J. Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)", RFC 5357, DOI 10
.17487 , , <https:///RFC5357 www >..rfc -editor .org /info /rfc5357 - [RFC5424]
-
Gerhards, R., "The Syslog Protocol", RFC 5424, DOI 10
.17487 , , <https:///RFC5424 www >..rfc -editor .org /info /rfc5424 - [RFC6020]
-
Bjorklund, M., Ed., "YANG - A Data Modeling Language for the Network Configuration Protocol (NETCONF)", RFC 6020, DOI 10
.17487 , , <https:///RFC6020 www >..rfc -editor .org /info /rfc6020 - [RFC6241]
-
Enns, R., Ed., Bjorklund, M., Ed., Schoenwaelder, J., Ed., and A. Bierman, Ed., "Network Configuration Protocol (NETCONF)", RFC 6241, DOI 10
.17487 , , <https:///RFC6241 www >..rfc -editor .org /info /rfc6241 - [RFC6812]
-
Chiba, M., Clemm, A., Medley, S., Salowey, J., Thombare, S., and E. Yedavalli, "Cisco Service-Level Assurance Protocol", RFC 6812, DOI 10
.17487 , , <https:///RFC6812 www >..rfc -editor .org /info /rfc6812 - [RFC7011]
-
Claise, B., Ed., Trammell, B., Ed., and P. Aitken, "Specification of the IP Flow Information Export (IPFIX) Protocol for the Exchange of Flow Information", STD 77, RFC 7011, DOI 10
.17487 , , <https:///RFC7011 www >..rfc -editor .org /info /rfc7011 - [RFC7258]
-
Farrell, S. and H. Tschofenig, "Pervasive Monitoring Is an Attack", BCP 188, RFC 7258, DOI 10
.17487 , , <https:///RFC7258 www >..rfc -editor .org /info /rfc7258 - [RFC7276]
-
Mizrahi, T., Sprecher, N., Bellagamba, E., and Y. Weingarten, "An Overview of Operations, Administration, and Maintenance (OAM) Tools", RFC 7276, DOI 10
.17487 , , <https:///RFC7276 www >..rfc -editor .org /info /rfc7276 - [RFC7540]
-
Belshe, M., Peon, R., and M. Thomson, Ed., "Hypertext Transfer Protocol Version 2 (HTTP/2)", RFC 7540, DOI 10
.17487 , , <https:///RFC7540 www >..rfc -editor .org /info /rfc7540 - [RFC7575]
-
Behringer, M., Pritikin, M., Bjarnason, S., Clemm, A., Carpenter, B., Jiang, S., and L. Ciavaglia, "Autonomic Networking: Definitions and Design Goals", RFC 7575, DOI 10
.17487 , , <https:///RFC7575 www >..rfc -editor .org /info /rfc7575 - [RFC7799]
-
Morton, A., "Active and Passive Metrics and Methods (with Hybrid Types In-Between)", RFC 7799, DOI 10
.17487 , , <https:///RFC7799 www >..rfc -editor .org /info /rfc7799 - [RFC7854]
-
Scudder, J., Ed., Fernando, R., and S. Stuart, "BGP Monitoring Protocol (BMP)", RFC 7854, DOI 10
.17487 , , <https:///RFC7854 www >..rfc -editor .org /info /rfc7854 - [RFC7950]
-
Bjorklund, M., Ed., "The YANG 1.1 Data Modeling Language", RFC 7950, DOI 10
.17487 , , <https:///RFC7950 www >..rfc -editor .org /info /rfc7950 - [RFC8040]
-
Bierman, A., Bjorklund, M., and K. Watsen, "RESTCONF Protocol", RFC 8040, DOI 10
.17487 , , <https:///RFC8040 www >..rfc -editor .org /info /rfc8040 - [RFC8084]
-
Fairhurst, G., "Network Transport Circuit Breakers", BCP 208, RFC 8084, DOI 10
.17487 , , <https:///RFC8084 www >..rfc -editor .org /info /rfc8084 - [RFC8085]
-
Eggert, L., Fairhurst, G., and G. Shepherd, "UDP Usage Guidelines", BCP 145, RFC 8085, DOI 10
.17487 , , <https:///RFC8085 www >..rfc -editor .org /info /rfc8085 - [RFC8259]
-
Bray, T., Ed., "The JavaScript Object Notation (JSON) Data Interchange Format", STD 90, RFC 8259, DOI 10
.17487 , , <https:///RFC8259 www >..rfc -editor .org /info /rfc8259 - [RFC8321]
-
Fioccola, G., Ed., Capello, A., Cociglio, M., Castaldelli, L., Chen, M., Zheng, L., Mirsky, G., and T. Mizrahi, "Alternate
-Marking Method for Passive and Hybrid Performance Monitoring" , RFC 8321, DOI 10.17487 , , <https:///RFC8321 www >..rfc -editor .org /info /rfc8321 - [RFC8639]
-
Voit, E., Clemm, A., Gonzalez Prieto, A., Nilsen-Nygaard, E., and A. Tripathy, "Subscription to YANG Notifications", RFC 8639, DOI 10
.17487 , , <https:///RFC8639 www >..rfc -editor .org /info /rfc8639 - [RFC8641]
-
Clemm, A. and E. Voit, "Subscription to YANG Notifications for Datastore Updates", RFC 8641, DOI 10
.17487 , , <https:///RFC8641 www >..rfc -editor .org /info /rfc8641 - [RFC8671]
-
Evens, T., Bayraktar, S., Lucente, P., Mi, P., and S. Zhuang, "Support for Adj-RIB-Out in the BGP Monitoring Protocol (BMP)", RFC 8671, DOI 10
.17487 , , <https:///RFC8671 www >..rfc -editor .org /info /rfc8671 - [RFC8762]
-
Mirsky, G., Jun, G., Nydell, H., and R. Foote, "Simple Two-Way Active Measurement Protocol", RFC 8762, DOI 10
.17487 , , <https:///RFC8762 www >..rfc -editor .org /info /rfc8762 - [RFC8889]
-
Fioccola, G., Ed., Cociglio, M., Sapio, A., and R. Sisto, "Multipoint Alternate
-Marking Method for Passive and Hybrid Performance Monitoring" , RFC 8889, DOI 10.17487 , , <https:///RFC8889 www >..rfc -editor .org /info /rfc8889 - [RFC8924]
-
Aldrin, S., Pignataro, C., Ed., Kumar, N., Ed., Krishnan, R., and A. Ghanwani, "Service Function Chaining (SFC) Operations, Administration, and Maintenance (OAM) Framework", RFC 8924, DOI 10
.17487 , , <https:///RFC8924 www >..rfc -editor .org /info /rfc8924 - [RFC9069]
-
Evens, T., Bayraktar, S., Bhardwaj, M., and P. Lucente, "Support for Local RIB in the BGP Monitoring Protocol (BMP)", RFC 9069, DOI 10
.17487 , , <https:///RFC9069 www >..rfc -editor .org /info /rfc9069 - [RFC9197]
-
Brockners, F., Ed., Bhandari, S., Ed., and T. Mizrahi, Ed., "Data Fields for In Situ Operations, Administration, and Maintenance (IOAM)", RFC 9197, DOI 10
.17487 , , <https:///RFC9197 www >..rfc -editor .org /info /rfc9197 - [W3C
.REC -xml -20081126] -
Bray, T., Paoli, J., Sperberg
-Mc , Maler, E., and F. Yergeau, "Extensible Markup Language (XML) 1.0 (Fifth Edition)", World Wide Web Consortium Recommendation RECQueen, M. -xml , , <https://-20081126 www >..w3 .org /TR /2008 /REC -xml -20081126 - [y1731]
-
ITU-T, "Operations, administration and maintenance (OAM) functions and mechanisms for Ethernet-based networks", ITU-T Recommendation G.8013/Y.1731, , <https://
www >..itu .int /rec /T -REC -Y .1731 /en
Appendix A. A Survey on Existing Network Telemetry Techniques
In this non-normative appendix, we provide an overview of some existing techniques and standard proposals for each network telemetry module.¶
A.1. Management Plane Telemetry
A.1.1. Push Extensions for NETCONF
NETCONF [RFC6241] is a popular network management protocol recommended by IETF. Its core strength is for managing configuration, but it can also be used for data collection. YANG-Push [RFC8639] [RFC8641] extends NETCONF and enables subscriber applications to request a continuous, customized stream of updates from a YANG datastore. Providing such visibility into changes made upon YANG configuration and operational objects enables new capabilities based on the remote mirroring of configuration and operational state. Moreover, a distributed data collection mechanism [NETCONF
A.1.2. gRPC Network Management Interface
gRPC Network Management Interface (gNMI) [gnmi] is a network management protocol based on the gRPC [grpc] Remote Procedure Call (RPC) framework. With a single gRPC service definition, both configuration and telemetry can be covered. gRPC is an open-source micro-service communication framework based on HTTP/2 [RFC7540]. It provides a number of capabilities that are well-suited for network telemetry, including:¶
A.2. Control Plane Telemetry
A.2.1. BGP Monitoring Protocol
BMP [RFC7854] is used to monitor BGP sessions and is intended to provide a convenient interface for obtaining route views.¶
BGP routing information is collected from the monitored device(s) to the BMP monitoring station by setting up the BMP TCP session. The BGP peers are monitored by the BMP Peer Up and Peer Down notifications. The BGP routes (including Adj_RIB_In [RFC7854], Adj_RIB_out [RFC8671], and local RIB [RFC9069]) are encapsulated in the BMP Route Monitoring Message and the BMP Route Mirroring Message, providing both an initial table dump and real-time route updates. In addition, BGP statistics are reported through the BMP Stats Report Message, which could be either timer triggered or event-driven. Future BMP extensions could further enrich BGP monitoring applications.¶
A.3. Data Plane Telemetry
A.3.1. Alternate-Marking (AM) Technology
The Alternate
This technique can be applied to point-to-point and multipoint
The Alternate
Since networks offer rich sets of network performance measurement data (e.g., packet counters), conventional approaches run into limitations. The bottleneck is the generation and export of the data and the amount of data that can be reasonably collected from the network. In addition, management tasks related to determining and configuring which data to generate lead to significant deployment challenges.¶
The Multipoint Alternate
An application orchestrates network performance measurement tasks across the network to allow for optimized monitoring. The application can choose how roughly or precisely to configure measurement points depending on the application's requirements.¶
Using Alternate Marking, it is possible to monitor a Multipoint Network without in-depth examination by using Network Clustering (subnetworks that are portions of the entire network that preserve the same property of the entire network, called clusters). So in the case where there is packet loss or the delay is too high, the specific filtering criteria could be applied to gather a more detailed analysis by using a different combination of clusters up to a per-flow measurement as described in the Alternate
In summary, an application can configure end-to-end network monitoring. If the network does not experience issues, this approximate monitoring is good enough and is very cheap in terms of network resources. However, in case of problems, the application becomes aware of the issues from this approximate monitoring and, in order to localize the portion of the network that has issues, configures the measurement points more extensively, allowing more detailed monitoring to be performed. After the detection and resolution of the problem, the initial approximate monitoring can be used again.¶
A.3.2. Dynamic Network Probe
A hardware-based Dynamic Network Probe (DNP) [OPSAWG-DNP4IQ] provides a programmable means to customize the data that an application collects from the data plane. A direct benefit of DNP is the reduction of the exported data. A full DNP solution covers several components including data source, data subscription, and data generation. The data subscription needs to define the derived data that can be composed and derived from raw data sources. The data generation takes advantage of the moderate in-network computing to produce the desired data.¶
While DNP can introduce unforeseeable flexibility to the data plane telemetry, it also faces some challenges. It requires a flexible data plane that can be dynamically reprogrammed at runtime. The programming Application Programming Interface (API) is yet to be defined.¶
A.3.3. IP Flow Information Export (IPFIX) Protocol
Traffic on a network can be seen as a set of flows passing through network elements. IPFIX [RFC7011] provides a means of transmitting traffic flow information for administrative or other purposes. A typical IPFIX-enabled system includes a pool of Metering Processes that collects data packets at one or more Observation Points, optionally filters them, and aggregates information about these packets. An Exporter then gathers each of the Observation Points together into an Observation Domain and sends this information via the IPFIX protocol to a Collector.¶
A.3.4. In Situ OAM
Classical passive and active monitoring and measurement techniques are either inaccurate or resource consuming. It is preferable to directly acquire data associated with a flow's packets when the packets pass through a network. IOAM [RFC9197], a data generation technique, embeds a new instruction header to user packets, and the instruction directs the network nodes to add the requested data to the packets. Thus, at the path's end, the packet's experience gained on the entire forwarding path can be collected. Such firsthand data is invaluable to many network OAM applications.¶
However, IOAM also faces some challenges. The issues on performance impact, security, scalability and overhead limits, encapsulation difficulties in some protocols, and cross-domain deployment need to be addressed.¶
A.3.5. Postcard-Based Telemetry
The postcard-based telemetry, as embodied in IOAM Direct Export (DEX) [IPPM
A.3.6. Existing OAM for Specific Data Planes
Various data planes raise unique OAM requirements. IETF has published OAM technique and framework documents (e.g., [RFC8924] and [RFC5085]) targeting different data planes such as Multiprotocol Label Switching (MPLS), L2 Virtual Private Network (VPN), Network Virtualization over Layer 3 (NVO3), Virtual Extensible LAN (VXLAN), Bit Index Explicit Replication (BIER), Service Function Chaining (SFC), Segment Routing (SR), and Deterministic Networking (DETNET). The aforementioned data plane telemetry techniques can be used to enhance the OAM capability on such data planes.¶
A.4. External Data and Event Telemetry
A.4.1. Sources of External Events
To ensure that the information provided by external event detectors and used by the network management solutions is meaningful for management purposes, the network telemetry framework must ensure that such detectors (sources) are easily connected to the management solutions (sinks). This requires the specification of a list of potential external data sources that could be of interest in network management and matching it to the connectors and/or interfaces required to connect them.¶
Categories of external event sources that may be of interest to network management include:¶
Additional detector types can be added to the system, but generally they will be the result of composing the properties offered by these main classes.¶
A.4.2. Connectors and Interfaces
For allowing external event detectors to be properly integrated with other management solutions, both elements must expose interfaces and protocols that are subject to their particular objective. Since external event detectors will be focused on providing their information to their main consumers, which generally will not be limited to the network management solutions, the framework must include the definition of the required connectors for ensuring the interconnection between detectors (sources) and their consumers within the management systems (sinks) are effective.¶
In some situations, the interconnection between external event detectors and the management system is via the management plane. For those situations, there will be a special connector that provides the typical interfaces found in most other elements connected to the management plane. For instance, the interfaces could accomplish this with a specific data model (YANG) and specific telemetry protocol, such as NETCONF, YANG-Push, or gRPC.¶
Acknowledgments
We would like to thank Rob Wilton, Greg Mirsky, Randy Presuhn, Joe Clarke, Victor Liu, James Guichard, Uri Blumenthal, Giuseppe Fioccola, Yunan Gu, Parviz Yegani, Young Lee, Qin Wu, Gyan Mishra, Ben Schwartz, Alexey Melnikov, Michael Scharf, Dhruv Dhody, Martin Duke, Roman Danyliw, Warren Kumari, Sheng Jiang, Lars Eggert, Éric Vyncke, Jean-Michel Combes, Erik Kline, Benjamin Kaduk, and many others who have provided helpful comments and suggestions to improve this document.¶
Contributors
The other contributors of this document are Tianran Zhou, Zhenbin Li, Zhenqiang Li, Daniel King, Adrian Farrel, and Alexander Clemm.¶