The DarkBERT AI model is trained with the darkest corners of the internet

Designing an artificial intelligence (AI) model built on dark web content? It’s the idea of ​​a South Korean team, with a project called DarkBERT. But, it is not at all about creating an evil version of ChatGPT.

It’s no secret: to make ChatGPT work, the OpenAI company had to build the “engine” beforehand. This is called a language model. When the famous chatbot was launched at the end of November 2022, it first relied on the language model called GPT-3.5. Then, since mid-March 2023, it can summon GPT-4, via a paid subscription.

The successive language models built by OpenAI are trained from data collected on the web, for example, from the Wikipedia encyclopedia or the Reddit community site. To get an idea, GPT-2 is based on 40 GB of text. GPT-3 on 570 GB. As for GPT-4, the information is kept secret, but the corpus is probably even larger.

ChatGPT OpenAI chatbot
ChatGPT was trained with the surface web. But on the Internet there are also much darker areas. // Source: Numerama

There are many language models, some of which fall into the category of Large Language Models (LLM). In addition to GPT, we can mention BERT and LaMDA from Google, Chinchilla de DeepMind, Claude from Anthropic or even LLaMA from Meta. Bloomberg even released its own, specializing in the financial sector: BloombergGPT.

The common point of these different LLMs is to rely on data taken from the web. An approach ignored by scientists attached to the Korea Advanced Institute of Science and Technology (KAIST) and employees of the company S2W Inc, which specializes in the analysis of cybersecurity data for cyber threat intelligence.

A trained language model on the dark web

Instead of building their language model from data pulled from the web, the team wanted to design one that was trained solely from information from the dark web. This is a section of the network that is not normally accessible with your web browser and that traditional search engines, such as Google or Bing, do not index.

This gave birth to the DarkBERT project, a name precisely inspired by one of Google’s projects. BERT, acronym for “Bidirectional Encoder Representations from Transformers”, is a project launched by the American company to better understand the meaning and context of a word, by examining what is before and after. This is, according to the company, key to understand the intent of a search.

As the project was designed for the dark web, it was therefore logically called DarkBERT. The dark web is sometimes described as the “dark side”, the “dark side” of the net. Special software is required to access it, because the dark web only exists on darknets, i.e. networks superimposed on the Internet. Tor and Freenet are darknets.

A part of the web is accessible on special networks: the darknets. // Source: Numerama

Scientists shared their work in a post on the arXiv website, as part of the platform’s pre-publication policy — the study has not been peer-reviewed or published in a recognized journal. It has been available since May 15, 2023 under the title “ DarkBERT: A Language Model for the Dark Side of the Internet ».

In the explanatory memorandum, the team behind DarkBERT explains that they wanted to produce a language model specifically tailored for the dark web, because studies on it ” usually require textual analysis “. This model was therefore pre-trained on a certain volume of data (5.83 GB of raw text and 5.2 GB of pre-processed text).

An exercise that has its limits and posed difficulties

To allow DarkBERT to adapt to the language used in the dark web, it was necessary to pre-train the model on a large-scale dark web corpus collected by browsing the Tor network. This corpus has been cleaned, in particular “ to address potential ethical concerns in texts related to sensitive information ».

The team also recognizes that the issue of personal data is not the only difficulty encountered. It also had to be content with working only on content in English, even if it already imagines a polyglot DarkBERT – which will require aspirating new data depending on the target languages. However, English remains the majority on the dark web.

Another risk that had to be taken into account was the collection of content falling under the Penal Code, for example. The researchers specifically mention child pornography, a crime they were able to escape by setting the collection to exclusively text content, which mechanically discarded files (videos, photos, etc.).

The Tor network.  // Source: Tor Project
The Tor network allows access to content that is not normally accessible on the Internet. // Source : Tor Project

Result ? ” Our evaluations show that DarkBERT outperforms current language models and can serve as a valuable resource for future research on the dark web “, assure the signatories of the study. The team says they tested their tool against other popular language models, but also against BERT, which was trained on the surface web, the visible web.

There is an unbridled version of ChatGPT, called DAN (Do Anything Now). It is presented as its evil twin, accessible, it is said, by “jailbreaking” the chatbot.

One question remains: why did you develop a language model of this type? Is it to create a corrupt alternative to ChatGPT, which would deliver inappropriate or even downright illicit responses? We often associate the dark web with the image of a sulphurous and underworld place, where we talk about sex, drugs, firearms, hacking, viruses and all kinds of crimes and misdemeanors.

It’s quite the opposite. Researchers see it as a tool designed to serve the “good” and to have a better view of what is happening there. ” We present potential use cases to illustrate the benefits of using DarkBERT in cybersecurity-related tasks such as dark web chat detection and ransomware or leak detection. »

This is a first draft, pending developments. ” We plan to improve the performance of dark web domain-specific pre-trained language models by leveraging newer architectures they say in conclusion. Eventually, the tool could be adjusted to scan the dark web much faster and much more often, to spot certain perils early.

The Batman of the AI, in short: he dives into the darkness so you don’t have to.

Looking to catch the latest advancements in AI? Do you want to decipher the concepts and acronyms of artificial intelligence? Wondering what will happen after ChatGPT and Midjourney? For this, only one address to be up to date: our free newsletter, Artificial:

Subscribe for free to Artificials, our AI newsletter, designed by AIs, verified by Numerama!

Source: Numerama by

*The article has been translated based on the content of Numerama by If there is any problem regarding the content, copyright, please leave a report below the article. We will try to process as quickly as possible to protect the rights of the author. Thank you very much!

*We just want readers to access information more quickly and easily with other multilingual content, instead of information only available in a certain language.

*We always respect the copyright of the content of the author and always include the original link of the source article.If the author disagrees, just leave the report below the article, the article will be edited or deleted at the request of the author. Thanks very much! Best regards!