Machine Learning for Better Threat Intelligence

March 04, 2020

Data processing takes place at a scale today that requires automation to be comprehensive. Combine data points from many different types of sources — including open, dark web, and technical sources — to form the most robust picture possible.

Recorded Future uses machine learning techniques in four ways to improve data collection and aggregation — to structure data into categories, to analyze text across multiple languages, to provide risk scores, and to generate predictive models.

1. To structure data into entities and events

Ontology has to do with how we split concepts up and how we group them together. In data science, ontologies represent categories of entities based on their names, properties, and relationships to each other, making them easier to sort into hierarchies of sets. For example, Boston, London, and Gothenburg are all distinct entities that will also fall under the broader “city” entity.

If entities represent a way to sort physically distinct concepts, then events sort concepts over time. Recorded Future events are language independent — something like “John visited Paris,” “John took a trip to Paris,” “Джон прилетел в Париж,” and “John a visité Paris” are all recognized as the same event.

Ontologies and events enable powerful searches over categories, letting analysts focus on the bigger picture rather than having to manually sort through data themselves.

2. To structure text in multiple languages through natural language processing

With natural language processing, entities and events are able to go beyond bare keywords, turning unstructured text from sources across different languages into a structured database.

The machine learning driving this process can separate advertising from primary content, classify text into categories like prose, data logs, or code, and disambiguate between entities with the same name (like “Apple” the company, and “apple” the fruit) by using contextual clues in the surrounding text.

This way, the system can parse text from millions of documents daily across seven different languages — a task that would require an impractically large and skilled team of human analysts to do. Saving time like this helps IT security teams work 32 percent more efficiently with Recorded Future.

3. To classify events and entities, helping human analysts prioritize alerts

Machine learning and statistical methodology are used to further sort entities and events by importance — for example, by assigning risk scores to malicious entities.

Risk scores are calculated through two systems: one driven by rules based on human intuition and experience, and the other driven by machine learning trained on an already vetted dataset.

Classifiers like risk scores provide both a judgment (“this event is critical”) and context explaining the score (“because multiple sources confirm that this IP address is malicious”).

Automating how risks are classified saves analysts time sorting through false positives and deciding what to prioritize, helping IT security staff who use Recorded Future spend 34 percent less time compiling reports.

4. To forecast events and entity properties through predictive models

Machine learning can also generate models that predict the future, oftentimes much more accurately than any human analysts, by drawing on the deep pools of data previously mined and categorized.

This is a particularly strong “law of large numbers” application of machine learning — as we continue to draw on more sources of data, these predictive models will become more and more accurate. For more on the cyber security process.

Search This Blog

Tech Zene's Corner