Evaluating Flow Features for Network Application Classification

Carlos Alcantara, University of Texas at El Paso


Communication networks provide the foundational services on which our modern economy depends. These services include data storage and transfer, video and voice telephony, gaming, multimedia streaming, remote invocation, and the world wide web. Communication networks are large-scale distributed systems composed of heterogeneous equipment. As a result of scale and heterogeneity, communication networks are cumbersome to manage (e.g., configure, assess performance, detect faults) by human operators. With the emergence of easily accessible network data and machine learning algorithms, there is a great opportunity to move network management towards increasing automation. Network management automation will allow for a reduced likelihood of human error in network configuration, improved productivity from network managers as redundant tasks are automated, simplified scalability, and greater insight into network operation. Network application classification, the process of identifying the network application associated with trains of packets called flows, is a critical task in the automation of network management. This association of network applications with network traffic is critical for improving network management as it will allow setting application-specific policies to optimize network operation, enhancing security measures by blocking certain applications with improved firewall configurations, and developing a more reliable quality of service by prioritizing time-sensitive applications. This work studies the classification performance of a basket of network flow features. We utilized three categories of flow features: inherent, derived, and engineered. In our first experimental analysis, we set out to uncover the inherent and derived feature's ability to classify network flows. We developed an expert system to generate application labels to serve as training data, which is used to train our models on two inherent and one derived feature. Flows are analyzed by implementing three supervised machine learning techniques for classification: k-nearest neighbors, decision trees and random forest. These experiments varied the number of applications and type of flows, all or only large, in a traffic data capture from UKY's university network. For our subsequent experimental analysis, we engineered three flow features based on host behavior presented by the authors of BLINC and examined their influence on traffic classification performance when combined with the features from the previous experiments. A new UKY data set is captured using deep packet inspection to obtain training labels and the same three machine learning techniques are employed. In these subsequent experiments, we varied the set of features used for classification by always including the three inherent and derived features and one combination of adding the three engineered features. Our initial experiments reveal that the inherent and derived features can adequately classify a subset of applications while focus on large flows slightly reduces performance. Our subsequent experimental analysis concludes that the use of engineered features provides a statistically significant improvement on classification performance for decision tree and random forest, while KNN is most effective with only the original three inherent and derived features.

Subject Area

Computer Engineering|Artificial intelligence

Recommended Citation

Alcantara, Carlos, "Evaluating Flow Features for Network Application Classification" (2020). ETD Collection for University of Texas, El Paso. AAI27993783.