Security News > 2020 > December > SoReL-20M: A Huge Dataset of 20 Million Malware Samples Released Online

SoReL-20M: A Huge Dataset of 20 Million Malware Samples Released Online
2020-12-14 05:34

Cybersecurity firms Sophos and ReversingLabs on Monday jointly released the first-ever production-scale malware research dataset to be made available to the general public that aims to build effective defenses and drive industry-wide improvements in security detection and response.

"SoReL-20M", as it's called, is a dataset containing metadata, labels, and features for 20 million Windows Portable Executable files, including 10 million disarmed malware samples, with the goal of devising machine-learning approaches for better malware detection capabilities.

Although EMBER was released in 2018 as an open-source malware classifier, its smaller sample size and its function as a single-label dataset meant it "Limit[ed] the range of experimentation that can be performed with it."

SoReL-20M aims to get around these problems with 20 million PE samples, which also includes 10 million disarmed malware samples, as well as extracted features and metadata for an additional 10 million benign samples.

The approach leverages a deep learning-based tagging model trained to generate human-interpretable semantic descriptions specifying important attributes of the samples involved.


News URL

http://feedproxy.google.com/~r/TheHackersNews/~3/gRBhQoGh1RI/sorel-20m-huge-dataset-of-20-million.html