Several Internet users asked us about an article, recently prepublished on the platform BioRxiv by evolutionary biologist Jesse Bloom, who explains that Chinese data on SARS-CoV-2 has been erased from an international database.

Data has been deleted from the SRA database

When a team of researchers sequences the genome of a virus, it has the possibility of sharing it with the scientific community in public databases, such as the Sequence Read Archive (SRA) operated by a division of the National Institutes of Health (NIH) in the United States. It is precisely in this SRA base that Bloom worked.

The researcher was interested in an article from May 2020, published in the journal PeerJ, which mentions the genetic sequences of SARS-CoV-2 transmitted to SRA until the end of March 2020. However, Bloom noticed that a sequencing set mentioned in PeerJ no longer appeared in SRA. The missing work was carried out by a mixed team of researchers from Wuhan and Shanghai.

In seeking to trace the erased data, Bloom got hold of an SRA backup file. stored on the Google Cloud platform. In the article presented on BioRxiv (which has not yet been reviewed by independent researchers before publication), Bloom explains having unearthed thirteen genetic sequencing of SARS-CoV-2. Originality of the viruses sequenced: they are, in all likelihood, slightly “older” than those identified in infected patients in December 2019 at the now famous Wuhan seafood market.

Older? This statement deserves a little explanation. The more time passes, the more a virus spreads, and the more it mutates and accumulates mutations. Currently, the virus known to be the closest genetically to SARS-CoV-2 is called RaTg13, a bat virus sequenced by researchers in Wuhan in 2013. The SARS-CoV-2 viruses collected in December 2019 from patients of the Wuhan seafood market differ from RaTg13 in a number of mutations. The viruses that infected people in 2020 and 2021 have accumulated many additional mutations compared to these sequences at the end of 2019. However, the sequences exhumed by Bloom present… a few fewer mutations than those associated with patients on the market. It therefore appears that the missing sequences correspond to early cases of SARS-CoV-2 infection, in the fall of 2019.

The fact that patients at the Wuhan seafood market are not the very first victims of Covid-19 is nothing new: as of January 2020, work confirmed that cases unrelated to the market had taken place during December. What surprises Jesse Bloom is above all that these sequences that could help the scientific community to trace the genetic evolution of the virus disappeared from the SRA after March 2020.

In fact, the managers of the SRA authorize the researchers to request the erasure of the transmitted data, a request which may in particular be motivated by an error in the initial transmission of the data. Interviewed by several media, including Release, the NIH specified that the sequences identified by Bloom had indeed been deleted from the database, “At the request of the researcher who holds the rights to the data.”

The deletion would have been motivated by a “updating data” transmitted, and by the wish to submit the sequences “To another database”. The NIH told CheckNews that the disputed sequences “Were submitted to the SRA in March 2020 and were the subject of a withdrawal request in June 2020”.

Asked by journalists from the scientific journal Science, Jesse Bloom explained that he did not find any of these sequences in the databases he consulted. On TwitterBloom later noted that the footage set had also been released to the China National GeneBank (CNGB), and that it was deleted from it between mid-June and early July.

Solicited by CheckNews, Chinese researchers have not given us more details on this point at the time of publishing this article.

A trace in the scientific literature

Failing to identify a database collecting the “corrected” version of these genetic sequences, we find their traces … in the scientific literature. Indeed, on March 6, 2020 – three weeks before the PeerJ article listed SRA data on SARS-CoV-2 – Chinese researchers pre-published an analysis of their sequences on the medRxiv platform. A final version of this research was published at the end of June of the same year. in the review Small. The data erased from SRA were therefore not concealed and thrown into oblivion, since they exist in the scientific literature (Even if the article by Small does not present the full sequencing of viruses, and focuses on their characteristic mutations).

The existence of these publications is mentioned by Jesse Bloom himself in his article. The biologist remains surprised, however, by the removal of the sequences from the SRA database, which is extremely useful for researching and comparing genetic data. According to him, as presented in the article by Small, Chinese sequences are very difficult to use and put into perspective. “No one was aware of these sequences, because the way people find sequences is to go to the sequence databases, download the sequences and look at them,” Bloom explained to Science.

According to the biologist, no difference between the data erased from the ARS and those presented in Small does not demonstrate the existence of an “update” of the data initially communicated. In his article, Bloom judges that the case “Suggests that at least in one instance, the trust structures of science have been abused to obscure relevant sequences [pour comprendre] the early spread of SARS-CoV-2 in Wuhan ”.

Chinese data is not the only one to have been deleted

The data identified by Bloom is not the only data that has been erased from the ARS since the start of the pandemic, as noted by the Washington Post. Seven other withdrawal requests, “Mainly from bidders [localisés aux] United States” were processed between June 2020 and June 2021, the NIH confirmed daily.

Asked by Science, several researchers have expressed reservations about Bloom’s article. So, Stephen Goldstein, evolutionary virologist at the University of Utah, is astonished that his colleague in Seattle sees the proof “Of a concealment” of the Chinese team, while their relevant information “Have been online for over a year” (on BioRxiv and in Small). Publication in Small “unfortunately went under the radar”, he judges, but is well accessible to the scientific community. Andrew Rambaut, another evolutionary biologist affiliated with the University of Edinburgh, judges «grotesque» the idea that the Chinese team “Tried to hide something”. “If they had wanted [le faire], they would surely not have submitted the article ”, he continues, adding that the scientific article was pre-published and submitted to Small before the sequences are the subject of a request for deletion from the SRA database.

Asked by several journalists about the contribution of his article to the debate on the origin of the virus, Bloom judged that his observations “Did not support the hypothesis of a laboratory leak or that of zoonosis”. On the other hand, he considers that they provide “Further evidence that this virus was probably circulating in Wuhan before December”. Bloom is also one of the co-signatories ofa letter recently published by Science calling on the World Health Organization to carry out a more thorough and impartial investigation into the origin of the virus.

