Experience the first darkweb-trained AI, DarkBERT

What is DarkBERT?

DarkBERT is a language model that was trained by S2W with its vast collection of Dark Web data. While other similarly constructed encoder language models struggle with the extreme lexical and structural diversity of Dark Web language, DarkBERT has been specifically trained to comprehend the illicit content of the Dark Web. DarkBERT further trains the RoBERTa model with masked language modeling (MLM) of texts collected from the Dark Web.

The corpus collection is a fundamental challenge in training DarkBERT. S2W, renowned for its capabilities of collecting and analyzing Dark Web data and Doppelgängers on the DarkWeb, amassed a large dark web text corpus fit for training. The quality of the corpus was refined by removing pages that were redundant, duplicates, or had low information density. Even after filtering, we had a sizable corpus of 5.83 GB.

Blank

Blank

Blank

Compare DarkBERT with Google BARD and OpenAI ChatGPT.

Blank

Blank

BLANK

How we use DarkBERT?

1) Dark Web Page Classification: The Dark Web is home to numerous pages full of explicit content dedicated to different types of cybercrime. Automatically classifying pages based on their content is invaluable for timely Dark Web intelligence. DarkBERT achieves state-of-the-art performance on the dark web page classification task, which aims to classify webpage content into topics such as Pornography, Hacking, and Violence. Our page classification schema is described in Shedding New Light on the Language of the Dark Web.

Blank

2) Ransomware Leak Site Detection: Ransomware-operating cybercriminals often operate “leak sites” to publish confidential data of uncooperative victim companies. Finding these websites quickly is crucial to gather intelligence of high-profile ransomware groups. DarkBERT achieved state-of-the-art performance in automatically detecting leak sites.

3) Noteworthy Thread Detection: Underground forums serve as platforms for sharing and selling information related to various illegal activities. Monitoring forums is challenging since there are countless users that can make posts on any topic. Filtering posts to find noteworthy threads (such as those selling/sharing confidential information or malicious hacking tools) is essential for effective monitoring. DarkBERT achieved state-of-the-art performance in automatically detecting noteworthy forum threads.

Blank

4) Threat Keyword Inference: Some familiar words may have a completely different meaning in the dark web. DarkBERT is trained to understand slang and explicit language used by cybercriminals, allowing us to understand word usage in dark web contexts.

Blank

Blank

Another example of DarkBERT demo with BARD and ChatGPT.

Blank

Blank

Blank

Here is a solution powered by DarkBERT.

Blank

S2W’s AI team who made DarkBERT.

Blank

Blank