Machine Learning for Better Threat Intelligence
Data processing takes place at a
scale today that requires automation to be comprehensive. Combine data points
from many different types of sources — including open, dark web, and technical
sources — to form the most robust picture possible.
Recorded Future uses machine learning techniques in four ways to improve data
collection and aggregation — to structure data into categories, to analyze text
across multiple languages, to provide risk scores, and to generate predictive
models.
1. To structure data into entities and
events
Ontology has to do with how we split concepts up and how we group them
together. In data science, ontologies represent
categories of entities based
on their names, properties, and relationships to each other, making them easier
to sort into hierarchies of sets. For example, Boston, London, and Gothenburg
are all distinct entities that will also fall under the broader “city” entity.
If entities represent a way to sort physically distinct concepts, then events sort concepts over
time. Recorded Future events are language independent — something like “John
visited Paris,” “John took a trip to Paris,” “Джон прилетел в Париж,” and “John
a visité Paris” are all recognized as the same event.
Ontologies and events enable powerful searches over categories, letting
analysts focus on the bigger picture rather than having to manually sort
through data themselves.
2. To structure text in multiple
languages through natural language processing
With natural language processing, entities and events are able to go
beyond bare keywords, turning unstructured text from sources across different
languages into a structured database.
The machine learning driving this process can separate advertising from
primary content, classify text into categories like prose, data logs, or code,
and disambiguate between entities with the same name (like “Apple” the company,
and “apple” the fruit) by using contextual clues in the surrounding text.
This way, the system can parse text from millions of documents daily
across seven different languages — a task that would require an impractically
large and skilled team of human analysts to do. Saving time like this helps IT
security teams work 32 percent more efficiently with Recorded Future.
3. To classify events and entities,
helping human analysts prioritize alerts
Machine learning and statistical methodology are used to further sort
entities and events by importance — for example, by assigning risk scores to
malicious entities.
Risk scores are calculated through two systems: one driven by rules based
on human intuition and experience, and the other driven by machine learning
trained on an already vetted dataset.
Classifiers like risk scores provide both a judgment (“this event is
critical”) and context explaining
the score (“because multiple sources confirm that this IP address is
malicious”).
Automating how risks are classified saves analysts time sorting through
false positives and deciding what to prioritize, helping IT security staff who
use Recorded Future spend 34 percent less time compiling reports.
4. To forecast
events and entity properties through predictive models
Machine learning can also generate models that predict the future,
oftentimes much more accurately than any human analysts, by drawing on the deep
pools of data previously mined and categorized.
This is a particularly strong “law of large numbers” application of
machine learning — as we continue to draw on more sources of data, these
predictive models will become more and more accurate. For more on the cyber security process.
Comments
Post a Comment