Data-driven models of reputation in cyber-security


In this talk, I will present our work on developing data-driven, predictive models of reputation (such as benign or malicious) for end-point hosts. I'll focus on two particular questions:

1) Malware often relies on so-called domain-generation algorithms (DGAs) to produce "fake" domain names that are used to connect compromised hosts with a command-and-control server. Many types of DGAs are been developed, from simple hashing techniques to more sophisticated approaches based on wordlists. I will show that these malware-generated domain names can be detected through recurrent neural networks such as LSTMs or GRUs.

2) The second part of the talk will focus on neural models of traffic reputation learned from passive DNS data. Passive DNS data are collections of inter-server DNS queries captured by sensors distributed on the network. This data is a goldmine for predicting whether a given domain name or IP address is likely to be benign or malicious. I will describe a deep neural architecture that predicts the reputation of end-point hosts with high accuracy. The neural model is trained on a large passive DNS dataset (745 million entries) and relies on a broad range of features extracted from the DNs graph.