By: Walker Rowe, April 24, 2017 (09:45 AM)

Using Threat Intelligence Feeds and Machine Learning to Flag Malicious Traffic

Threat intelligence feeds are data gathered from attacks around the world and either sold or given away for free. The idea is to take logs from your firewalls and the match those with the list of blacklisted IP addresses, domains, the email address of who registered a domain, etc. to determine what IP addresses to block to keep bad actors out. You can use the GeoIP feature of LogStash ElasticSearch too to place these IP address in particular countries.

But it is not so simple as blocking IP addresses. You cannot simply block whole IP address ranges. That will cut off legitimate traffic as hackers hide their dishonest attack in honest traffic using botnets whose owner do not even know they are being used to attack other people. Also they register their domains and IPs with legitimate registrars. So you need to look at other threat indicators, like what URI they are trying to reach, like the WordPress login wp-admin, etc. Then you take the honest and hostile traffic and feed it into ML algorithms to train the algorithms. Training creates a model by which this data you have vacuumed up worldwide can be applied to your own set of data to draw conclusions.

Threat Intelligence and Legit Traffic Data Sources

When you get all these data you convert all of it into numeric format and feed it into machine learning models. Those numbers become a set of labels and features. For example, a label could be “hacker” (1) or “not a hacker” (0) and the features are as many relevant data points you can extract from your own firewalls and the threat intelligence feeds.

Data models need data sources. You can get the top 20 subnets that have launched attacks in the past three days from the SANS Technology Institute here. You can get non-malicious IPs from Alexa (now owned by Amazon) and global firewall logs from Dshield datafeeds.

Necessary Software and Skills

That is the easy part. Now you need someone who understands big data and data science. Not all programmers who know big data know data science (i.e., analytics and machine learning). Because that requires a knowledge of math and statistics. It is not enough to know just what the average and standard deviation means. They need to know linear algebra.

Putting together the infrastructure is fairly easy. Machine learning algorithms are built into Apache Spark ML. So install Apache Spark. To gather logs you need ElasticSearch. Then you can use ElasticSearch for Hadoop to pull data from ES and work with it in Hadoop or Spark.

Then us Pig, Hive, Scala, Python, or R to extract and transform the data, join different data feeds and finally apply algorithms to detect network anomalies. Push those results back down into ES and create visualizations using Kibana.

Be Informed. Stay One Step Ahead.

Sign up for our newsletter and stay up to date with the latest industry news, trends, and technologies