By 422737

Pinpointing Anomaly Events in Logs from Stability Testing - N-Grams vs. Deep-Learning

Abstract

As stability testing execution logs can be very long, software engineers need help in locating anomalous events to investigate when failures occur. We develop and evaluate two models for scoring individual log-events for anomalousness, namely an N-Gram model and a Deep Learning model with LSTM cells. Both are trained on normal log sequences only. We evaluate the models with long log sequences produced by stability testing in our company case and contrast that with very short log sequences from HDFS public dataset. We evaluate the prediction accuracy of the next event and computational efficiency. The N-Gram model is more accurate in stability testing logs (0.848 vs 0.831), whereas almost identical accuracy is seen in HDFS logs (0.849 vs 0.847). The N-Gram model has superior computational efficiency compared to the Deep model (4 to 13 seconds vs 16 minutes to nearly 4 hours), making it the preferred choice for our case company. Scoring individual log events for anomalousness seems like a good aid for root cause analysis of failing test cases, and our case company plans to add it to its online services. Despite the recent surge in using deep learning in software system anomaly detection, we found no benefits in doing so. Future work should consider whether this holds with different model configurations, other datasets, and with other deep-learning approaches that promise better accuracy and computational efficiency than LSTM based models.

Publication
In proceedings of NEXTA 2022 - Online