Innar Liiv, Ph.D, IEEE Senior Member
Last update: 11 March 2021
CALL FOR PARTICIPANTS!
Challenge: Compress the 1,317,937,667 bytes Coronavirus dataset to less than 613,466 bytes!
"It seems to me that the most important discovery since Gödel was the discovery by Chaitin, Solomonoff and Kolmogorov of the concept called Algorithmic
Probability which is a fundamental new theory of how to make predictions given a collection of experiences and this is a beautiful theory, everybody should learn it, but it’s got one problem, that is,
that you cannot actually calculate what this theory predicts because it is too hard, it requires an infinite amount of work. However, it should be possible to make practical
approximations to the Chaitin, Kolmogorov, Solomonoff theory that would make better predictions than anything we have today. Everybody should learn all about that and spend the rest
of their lives working on it." -- Marvin Minsky (2014) "Being able to compress well is closely related to acting intelligently, thus reducing the slippery concept of intelligence to hard file size numbers. In order to compress data, one has to find regularities in them, which is intrinsically difficult (many researchers live from analyzing data and finding compact models). So compressors beating the current "dumb" compressors need to be smart(er)." -- Marcus Hutter (2006) |
Losslessly compress the 1.25GB file coronavirus.unwrapped.fasta to less than 613,466 bytes. [Read more: arXiv PDF]
The aim of this competition is to encourage multidisciplinary research to find the shortest lossless description for the sequences and to demonstrate that data compression can serve as an objective and repeatable measure to align scientific breakthroughs across disciplines. The shortest description of the data is the best model and therefore further reducing the size of this description requires fundamental understanding of the underlying context and data.
The data is presented in FASTA format, consisting of 44,981 concatenated SARS-CoV-2 sequences with a total uncompressed size of 1,317,937,667 bytes bytes, and is downloaded on 13
December 2020 from NCBI.NLM.NIH.GOV severe acute respiratory syndrome coronavirus 2 data hub:
File: coronavirus.unwrapped.fasta [1,317,937,667 bytes, MD5SUM e0d7c063a65c7625063c17c7c9760708] [Download ZIP version]
Archived (transitioned out versions of the dataset)
File: coronavirus.fasta [1,339,868,341 bytes, MD5SUM 96ae68def050d85b1b7f2f1565cb848f] [Download ZIP version]
File: coronavirus.2bit [332,133,731 bytes, MD5SUM db6ccc9e2ce649aa696e26fff3a60954] [Download ZIP version]
The transformation script from old coronavirus.fasta file to new coronavirus.fasta.uw:
cat coronavirus.fasta | sed 's/>.*/<&' | tr -d '\n' | tr '<' '\n' | tail -n +2 > coronavirus.unwrapped.fasta
printf "\n" >> coronavirus.unwrapped.fasta
The challenge itself is about compressing a lot of similar concatenated sequences. For a slightly simpler example, better susceptible to manual observation, the compression
results of only one sequence (the reference sequence NC_045512, published by Wu et al. in Nature 579, pp.265–269 (2020)) are presented in the following table:
File: NC_045512.fasta [30,416 bytes] [Download]
File: NC_045512.2bit [7,524 bytes] [Download]
Bytes | File | Format | Compressor | Parameters |
7233 | NC_045512 | FASTA | cmix | |
7277 | NC_045512 | FASTA | paq8l | -8 |
7277 | NC_045512 | FASTA | GeCo3 | -l 1 -lr 0.06 -hs 8 |
7308 | NC_045512 | 2Bit | cmix | |
7337 | NC_045512 | 2Bit | brotli | -q 10 |
7346 | NC_045512 | 2Bit | paq8l | -8 |
7355 | NC_045512 | 2Bit | zstd | -19 |
7369 | NC_045512 | 2Bit | bcm | -9 |
7376 | NC_045512 | 2Bit | gzip | -9 |
7508 | NC_045512 | 2Bit | xz | -9 |
7517 | NC_045512 | 2Bit | zip | -9 |
7524 | NC_045512 | 2Bit | Uncompressed | |
7545 | NC_045512 | 2Bit | rar | m5 |
7802 | NC_045512 | FASTA | bcm | -9 |
7868 | NC_045512 | 2Bit | bzip2 | -9 |
8399 | NC_045512 | FASTA | brotli | -q 11 |
8519 | NC_045512 | FASTA | zstd | -19 |
8801 | NC_045512 | FASTA | bzip2 | -9 |
9000 | NC_045512 | FASTA | xz | -9 |
9598 | NC_045512 | FASTA | gzip | -9 |
9623 | NC_045512 | FASTA | rar | m5 |
9738 | NC_045512 | FASTA | zip | -9 |
30416 | NC_045512 | FASTA | Uncompressed |
Bytes | Decompressor size | Archive size | File | Format | Compressor | Parameters | Submission | Decompression time |
613,466 | 13,499 | 599,967 | Coronavirus | FASTA | LILY-based | Márcio Pais (21 February 2021) | 1036 seconds [2] | |
1,001,144 | 218,149 | 782,995 | Coronavirus | FASTA | cmv | -m2,0,0x0ba36a7f | Mauro Vezzosi (16 February 2021) | |
1,029,832 | 163,453 | 866,379 | Coronavirus | FASTA | cmix | Innar Liiv (baseline) | (in progress) seconds [1] | |
1,353,808 | 30,491 | 1,323,317 | Coronavirus | FASTA | paq8l | -8 | Innar Liiv (baseline) | 86,110 seconds [1] |
1,715,228 | N/A | N/A | Coronavirus | FASTA | xz | -9 | ||
1,979,504 | N/A | N/A | Coronavirus | FASTA | bcm | -9 | ||
1,762,945 | N/A | N/A | Coronavirus | FASTA | brotli | -q 10 | ||
1,915,086 | N/A | N/A | Coronavirus | FASTA | zstd | -19 | ||
1,922,759 | N/A | N/A | Coronavirus | FASTA | rar | m5 | ||
39,083,402 | N/A | N/A | Coronavirus | FASTA | bzip2 | -9 | ||
15,713,689 | N/A | N/A | Coronavirus | FASTA | gzip | -9 | ||
15,713,816 | N/A | N/A | Coronavirus | FASTA | zip | -9 | ||
1,317,937,667 | N/A | N/A | Coronavirus | FASTA | Uncompressed |
Bytes | Decompressor size | Archive size | File | Format | Compressor | Parameters | Submission | Decompression time |
682,262 | 13,754 | 668,508 | Coronavirus | FASTA | LILY-based | Márcio Pais (12 February 2021) | 591 seconds [2] | |
973,571 | 34,713 | 938,858 | Coronavirus | FASTA | GeCo3 | -tm 3:1:1:1:0.8/0:0:0 -tm 6:1:1:1:0.85/0:0:0 -tm 9:1:1:1:0.85/0:0:0 -tm 12:10:0:1:0.85/0:0:0 -tm 15:200:1:10:0.85/2:1:0.85 -tm 17:200:1:10:0.85/2:1:0.85 -tm 20:500:1:40:0.85/5:20:0.85 -lr 0.03 -hs 64 | Milton Silva (10 January 2021) | 6079 seconds [3] [4] |
1,063,646 | 218,149 | 845,497 | Coronavirus | FASTA | cmv | -m2,0,0x0ba36a7f | Mauro Vezzosi (26 January 2021) | [5] |
1,169,683 | 32,170 | 1,137,513 | Coronavirus | 2Bit | LILY | Márcio Pais (01 January 2021) | 102 seconds [2] | |
1,238,330 | 30,491 | 1,207,839 | Coronavirus | 2Bit | paq8l | -8 | Innar Liiv (baseline) | 24,975 seconds [1] |
1,152,411 | 163,453 | 988,958 | Coronavirus | 2Bit | cmix | Innar Liiv (baseline) | 554,173 seconds [1] | |
1,425,590 | N/A | N/A | Coronavirus | FASTA | paq8l | -8 | ||
1,590,505 | N/A | N/A | Coronavirus | FASTA | NAF | Kirill Kryukov (19 January 2021) | ||
1,985,384 | N/A | N/A | Coronavirus | 2Bit | xz | -9 | ||
2,022,796 | N/A | N/A | Coronavirus | FASTA | xz | -9 | ||
2,043,140 | N/A | N/A | Coronavirus | 2Bit | bcm | -9 | ||
2,044,664 | N/A | N/A | Coronavirus | 2Bit | rar | m5 | ||
2,084,998 | 34,713 | 2,050,285 | Coronavirus | FASTA | GeCo3 | -l 1 -lr 0.06 -hs 8 | Innar Liiv (baseline) | 834 seconds [3] [4] |
2,367,487 | N/A | N/A | Coronavirus | 2Bit | zstd | -19 | ||
2,728,490 | N/A | N/A | Coronavirus | FASTA | bcm | -9 | ||
2,871,864 | N/A | N/A | Coronavirus | 2Bit | brotli | -q 10 | ||
2,871,864 | N/A | N/A | Coronavirus | FASTA | brotli | -q 10 | ||
4,217,341 | N/A | N/A | Coronavirus | FASTA | zstd | -19 | ||
5,924,805 | N/A | N/A | Coronavirus | FASTA | rar | m5 | ||
67,575,178 | N/A | N/A | Coronavirus | 2Bit | gzip | -9 | ||
67,575,325 | N/A | N/A | Coronavirus | 2Bit | zip | -9 | ||
75,530,790 | N/A | N/A | Coronavirus | 2Bit | bzip2 | -9 | ||
75,530,790 | N/A | N/A | Coronavirus | FASTA | bzip2 | -9 | ||
77,356,405 | N/A | N/A | Coronavirus | FASTA | gzip | -9 | ||
77,356,550 | N/A | N/A | Coronavirus | FASTA | zip | -9 | ||
332,133,731 | N/A | N/A | Coronavirus | 2Bit | Uncompressed | |||
1,339,868,341 | N/A | N/A | Coronavirus | FASTA | Uncompressed |
The rules are inspired by and therefore very similar to Matt Mahoney's Rules for Large Text Compression Benchmark.
Participants will be ranked by the compressed size of the coronavirus dataset plus the size of the decompressor as a zip archive. Participants can choose whether to use
the FASTA or 2Bit format of the dataset. Decompressor measurement can be done either using a compressed executable or a compressed source code, whichever is smaller.
The challenge is open for anyone to participate and contribute, submissions beating the current top solution should be sent to coronavirus( a t )innar.com for verification.
Join the Coronavirus Data Compression Benchmark Discussion Group!