SARS-CoV-2 Coronavirus Data Compression Benchmark

Innar Liiv, Ph.D, IEEE Senior Member
Last update: 11 March 2021

CALL FOR PARTICIPANTS!
Challenge: Compress the 1,317,937,667 bytes Coronavirus dataset to less than 613,466 bytes!

"It seems to me that the most important discovery since Gödel was the discovery by Chaitin, Solomonoff and Kolmogorov of the concept called Algorithmic Probability which is a fundamental new theory of how to make predictions given a collection of experiences and this is a beautiful theory, everybody should learn it, but it’s got one problem, that is, that you cannot actually calculate what this theory predicts because it is too hard, it requires an infinite amount of work. However, it should be possible to make practical approximations to the Chaitin, Kolmogorov, Solomonoff theory that would make better predictions than anything we have today. Everybody should learn all about that and spend the rest of their lives working on it."
-- Marvin Minsky (2014)

"Being able to compress well is closely related to acting intelligently, thus reducing the slippery concept of intelligence to hard file size numbers. In order to compress data, one has to find regularities in them, which is intrinsically difficult (many researchers live from analyzing data and finding compact models). So compressors beating the current "dumb" compressors need to be smart(er)."
-- Marcus Hutter (2006)

News

The Task

Losslessly compress the 1.25GB file coronavirus.unwrapped.fasta to less than 613,466 bytes. [Read more: arXiv PDF]

The Aim

The aim of this competition is to encourage multidisciplinary research to find the shortest lossless description for the sequences and to demonstrate that data compression can serve as an objective and repeatable measure to align scientific breakthroughs across disciplines. The shortest description of the data is the best model and therefore further reducing the size of this description requires fundamental understanding of the underlying context and data.

The Data

The data is presented in FASTA format, consisting of 44,981 concatenated SARS-CoV-2 sequences with a total uncompressed size of 1,317,937,667 bytes bytes, and is downloaded on 13 December 2020 from NCBI.NLM.NIH.GOV severe acute respiratory syndrome coronavirus 2 data hub:

File: coronavirus.unwrapped.fasta [1,317,937,667 bytes, MD5SUM e0d7c063a65c7625063c17c7c9760708] [Download ZIP version]

Archived (transitioned out versions of the dataset)
File: coronavirus.fasta [1,339,868,341 bytes, MD5SUM 96ae68def050d85b1b7f2f1565cb848f] [Download ZIP version]
File: coronavirus.2bit [332,133,731 bytes, MD5SUM db6ccc9e2ce649aa696e26fff3a60954] [Download ZIP version]

The transformation script from old coronavirus.fasta file to new coronavirus.fasta.uw:

cat coronavirus.fasta | sed 's/>.*/<& coronavirus.unwrapped.fasta
printf "\n" >> coronavirus.unwrapped.fasta

Setting the Scene

The challenge itself is about compressing a lot of similar concatenated sequences. For a slightly simpler example, better susceptible to manual observation, the compression results of only one sequence (the reference sequence NC_045512, published by Wu et al. in Nature 579, pp.265–269 (2020)) are presented in the following table:

File: NC_045512.fasta [30,416 bytes] [Download]
File: NC_045512.2bit [7,524 bytes] [Download]

BytesFileFormatCompressorParameters
7233NC_045512FASTAcmix
7277NC_045512FASTApaq8l-8
7277NC_045512FASTAGeCo3-l 1 -lr 0.06 -hs 8
7308NC_0455122Bitcmix
7337NC_0455122Bitbrotli-q 10
7346NC_0455122Bitpaq8l-8
7355NC_0455122Bitzstd-19
7369NC_0455122Bitbcm-9
7376NC_0455122Bitgzip-9
7508NC_0455122Bitxz-9
7517NC_0455122Bitzip-9
7524NC_0455122BitUncompressed
7545NC_0455122Bitrarm5
7802NC_045512FASTAbcm-9
7868NC_0455122Bitbzip2-9
8399NC_045512FASTAbrotli-q 11
8519NC_045512FASTAzstd-19
8801NC_045512FASTAbzip2-9
9000NC_045512FASTAxz-9
9598NC_045512FASTAgzip-9
9623NC_045512FASTArarm5
9738NC_045512FASTAzip-9
30416NC_045512FASTAUncompressed

Leaderboard (compressing coronavirus.unwrapped.fasta)

BytesDecompressor sizeArchive sizeFileFormatCompressorParametersSubmissionDecompression time
613,46613,499599,967CoronavirusFASTALILY-basedMárcio Pais (21 February 2021)1036 seconds [2]
1,001,144218,149782,995CoronavirusFASTAcmv-m2,0,0x0ba36a7fMauro Vezzosi (16 February 2021)
1,029,832163,453866,379CoronavirusFASTAcmix Innar Liiv (baseline)(in progress) seconds [1]
1,353,80830,4911,323,317CoronavirusFASTApaq8l-8Innar Liiv (baseline)86,110 seconds [1]
1,715,228N/AN/ACoronavirusFASTAxz-9
1,979,504N/AN/ACoronavirusFASTAbcm-9
1,762,945N/AN/ACoronavirusFASTAbrotli-q 10
1,915,086N/AN/ACoronavirusFASTAzstd-19
1,922,759N/AN/ACoronavirusFASTArarm5
39,083,402N/AN/ACoronavirusFASTAbzip2-9
15,713,689N/AN/ACoronavirusFASTAgzip-9
15,713,816N/AN/ACoronavirusFASTAzip-9
1,317,937,667N/AN/ACoronavirusFASTAUncompressed

Notes

[2] Validated by Innar Liiv using Windows 10 Education @ Intel Core i5-7500 CPU @ 3.40 GHz, 16GB RAM
[1] Tested by Innar Liiv using Ubuntu 18.04.5 LTS @ Intel Xeon CPU E5-2690 v3 @ 2.60GHz, 512GB RAM

Leaderboard (out-transitioning wrapped FASTA and 2Bit versions of datasets)

BytesDecompressor sizeArchive sizeFileFormatCompressorParametersSubmissionDecompression time
682,26213,754668,508CoronavirusFASTALILY-basedMárcio Pais (12 February 2021)591 seconds [2]
973,57134,713938,858CoronavirusFASTAGeCo3-tm 3:1:1:1:0.8/0:0:0 -tm 6:1:1:1:0.85/0:0:0 -tm 9:1:1:1:0.85/0:0:0 -tm 12:10:0:1:0.85/0:0:0 -tm 15:200:1:10:0.85/2:1:0.85 -tm 17:200:1:10:0.85/2:1:0.85 -tm 20:500:1:40:0.85/5:20:0.85 -lr 0.03 -hs 64Milton Silva (10 January 2021)6079 seconds [3] [4]
1,063,646218,149845,497CoronavirusFASTAcmv-m2,0,0x0ba36a7fMauro Vezzosi (26 January 2021)[5]
1,169,68332,1701,137,513Coronavirus2BitLILYMárcio Pais (01 January 2021)102 seconds [2]
1,238,33030,4911,207,839Coronavirus2Bitpaq8l-8Innar Liiv (baseline)24,975 seconds [1]
1,152,411163,453988,958Coronavirus2Bitcmix Innar Liiv (baseline)554,173 seconds [1]
1,425,590N/AN/ACoronavirusFASTApaq8l-8
1,590,505N/AN/ACoronavirusFASTANAFKirill Kryukov (19 January 2021)
1,985,384N/AN/ACoronavirus2Bitxz-9
2,022,796N/AN/ACoronavirusFASTAxz-9
2,043,140N/AN/ACoronavirus2Bitbcm-9
2,044,664N/AN/ACoronavirus2Bitrarm5
2,084,99834,7132,050,285CoronavirusFASTAGeCo3-l 1 -lr 0.06 -hs 8Innar Liiv (baseline)834 seconds [3] [4]
2,367,487N/AN/ACoronavirus2Bitzstd-19
2,728,490N/AN/ACoronavirusFASTAbcm-9
2,871,864N/AN/ACoronavirus2Bitbrotli-q 10
2,871,864N/AN/ACoronavirusFASTAbrotli-q 10
4,217,341N/AN/ACoronavirusFASTAzstd-19
5,924,805N/AN/ACoronavirusFASTArarm5
67,575,178N/AN/ACoronavirus2Bitgzip-9
67,575,325N/AN/ACoronavirus2Bitzip-9
75,530,790N/AN/ACoronavirus2Bitbzip2-9
75,530,790N/AN/ACoronavirusFASTAbzip2-9
77,356,405N/AN/ACoronavirusFASTAgzip-9
77,356,550N/AN/ACoronavirusFASTAzip-9
332,133,731N/AN/ACoronavirus2BitUncompressed
1,339,868,341N/AN/ACoronavirusFASTAUncompressed

Notes

[5] TODO: decompression time & MD5SUM validation
[4] Not valid, decompression is not lossless and 37,795,741 bytes goes missing, compared to the original
[3] Validated by Innar Liiv using Ubuntu 18.04.4 LTS @ Intel Core i9-9900 CPU @ 3.1 GHz, 64GB RAM
[2] Validated by Innar Liiv using Windows 10 Education @ Intel Core i5-7500 CPU @ 3.40 GHz, 16GB RAM
[1] Tested by Innar Liiv using Ubuntu 18.04.5 LTS @ Intel Xeon CPU E5-2690 v3 @ 2.60GHz, 512GB RAM

Rules and Participation

The rules are inspired by and therefore very similar to Matt Mahoney's Rules for Large Text Compression Benchmark.

Participants will be ranked by the compressed size of the coronavirus dataset plus the size of the decompressor as a zip archive. Participants can choose whether to use the FASTA or 2Bit format of the dataset. Decompressor measurement can be done either using a compressed executable or a compressed source code, whichever is smaller.

The challenge is open for anyone to participate and contribute, submissions beating the current top solution should be sent to coronavirus( a t )innar.com for verification.

Join the Coronavirus Data Compression Benchmark Discussion Group!

Advisory Board

More information