Skip to content

Owlyshield Ransomware Detection using XGBoost

Owlyshield is an EDR software designed to detect ransomware by monitoring their activity on the file disk. The software operates in two distinct modes:

  • Training mode: In this mode, Owlyshield writes a CSV file containing various features extracted from the monitored processes. This data is used to train the XGBoost machine learning model to identify ransomware effectively.
  • Real-time prediction mode: In this mode, Owlyshield uses the trained XGBoost model to predict whether the observed process activities are indicative of ransomware in real-time.

CSV Data Collection for Training

Owlyshield collects and aggregates data from various process activities and stores them in a CSV file. The CSV file has a header row followed by multiple lines containing the following columns:

ColumnDescription
app_nameName of the application being monitored
gidUnique identifier for each group of I/O operations
ops_readNumber of read operations performed
ops_setinfoNumber of set information operations performed
ops_writtenNumber of write operations performed
ops_openNumber of open operations performed
bytes_readTotal number of bytes read
bytes_writtenTotal number of bytes written
entropy_readEntropy of the read data
entropy_writtenEntropy of the written data
files_openedNumber of files opened
files_deletedNumber of files deleted
files_readNumber of files read
files_renamedNumber of files renamed
files_writtenNumber of files written
extensions_readNumber of unique file extensions read
extensions_writtenNumber of unique file extensions written
extensions_written_docNumber of document file extensions written
extensions_written_archivesNumber of archive file extensions written
extensions_written_dbNumber of database file extensions written
extensions_written_codeNumber of source code file extensions written
extensions_written_exeNumber of executable file extensions written
dirs_with_files_createdNumber of directories with files created
dirs_with_files_updatedNumber of directories with files updated
pidsNumber of process IDs involved in the operations
exe_existsIndicator if an executable file exists (1 for yes, 0 for no)
clustersNumber of clusters identified in the group of I/O operations
clusters_max_sizeMaximum size of a cluster in the group of I/O operations
is_ransomIndicator if the application is classified as ransomware (1 for ransomware, 0 for non-ransomware)

Follows an example of such a csv used for training a XGBoost classifier:

app_namegidops_readops_setinfoops_writtenops_openbytes_readbytes_writtenentropy_readentropy_writtenfiles_openedfiles_deletedfiles_readfiles_renamedfiles_writtenextensions_readextensions_writtenextensions_written_docextensions_written_archivesextensions_written_dbextensions_written_codeextensions_written_exedirs_with_files_createddirs_with_files_updatedpidsexe_existsclustersclusters_max_sizeis_ransom
Ransom_exe_Avaddon_09_06_2020_1054KB.exe2-146000200000000000300002001100True
Ransom_exe_Avaddon_09_06_2020_1054KB.exe2-1465003532256050001001400003001100True
Ransom_exe_Avaddon_09_06_2020_1054KB.exe2-1469005175264050002001400003001100True

A new row is added to the CSV file for every 50 I/O operations (calls to VFS functions on Linux) observed in the monitored processes. This frequency ensures that the data collected is representative of the process behavior without being too sparse or too dense.

Real-time Ransomware Prediction

In real-time prediction mode, Owlyshield continuously monitors and aggregates the same set of features as in the training mode. However, instead of writing the data to a CSV file, the data is stored in memory in a central structure. This structure is refreshed every 50 I/O operations, ensuring that the model is always working with the latest data.

XGBoost Overview

XGBoost (Extreme Gradient Boosting) is a powerful, open-source machine learning library designed for efficient and scalable implementation of gradient boosted decision trees. It has gained immense popularity due to its excellent performance, flexibility, and ability to handle large datasets.

For Owlyshield, the XGBoost model is translated into plain Rust code, further enhancing the performance and efficiency of the model. This optimized implementation enables real-time ransomware detection while maintaining low resource usage.

An example of a decision tree for ransomware classification

High-quality Training Data

The XGBoost model used by Owlyshield has been trained on an extensive dataset comprising recent ransomware samples and standard programs (non-malware). This diverse and up-to-date training data ensures that the model can effectively distinguish between ransomware activities and legitimate processes, minimizing the risk of false positives and false negatives.

Time-independent Metrics: I/O vs. Time

Owlyshield employs I/O-based metrics instead of time-based metrics for several reasons. Firstly, I/O operations are more consistent and reliable across various hardware configurations, while time-based metrics may be influenced by hardware performance variations. This ensures that the ransomware detection remains accurate and robust across different client systems.

Secondly, using I/O operations as a metric allows for a more granular representation of process activities, providing a better understanding of the process behavior. This enables the XGBoost model to make more accurate predictions based on the observed features, further improving the effectiveness of the ransomware detection.

In conclusion, Owlyshield leverages the high-performance XGBoost machine learning library, optimized by translating decision trees into plain Rust code, to detect ransomware effectively. The model is trained on a diverse dataset of recent ransomware and standard programs, and it employs I/O-based metrics instead of time-based metrics to ensure accurate and reliable ransomware detection across various client hardware configurations.