Looking for large text files for testing compression in all sizes

Question

I am looking for large text files for testing the compression and decompression in all sizes from 1kb to 100mb. Can someone please refer me to download it from some link ?

codemonkey · Answer 1 · 2021-03-06T04:29:47.980

24

*** Linux users only ***

Arbitrarily large text files can be generated on Linux with the following command:

tr -dc "A-Za-z 0-9" < /dev/urandom | fold -w100|head -n 100000 > bigfile.txt

This command will generate a text file that will contain 100,000 lines of random text and look like this:

NsQlhbisDW5JVlLSaZVtCLSUUrkBijbkc5f9gFFscDkoGnN0J6GgIFqdCLyhbdWLHxRVY8IwDCrWF555JeY0yD0GtgH21NotZAEe
iWJR1A4 bxqq9VKKAzMJ0tW7TCOqNtMzVtPB6NrtCIg8NSmhrO7QjNcOzi4N b VGc0HB5HMNXdyEoWroU464ChM5R Lqdsm3iPo
1mz0cPKqobhjDYkvRs5LZO8n92GxEKGeCtt oX53Qu6T7O2E9nJLKoUeJI6Ul7keLsNGI2BC55qs7fhqW8eFDsGsLPaImF7kFJiz
...
...

On my Ubuntu 18 its size it about 10MB. Bumping up the number of lines, and thereby bumping up the size, is easy. Just increase the head -n 100000 part. So, say, this command:

tr -dc "A-Za-z 0-9" < /dev/urandom | fold -w100|head -n 1000000 > bigfile.txt

will generate a file with 1,000,000 of random lines of text and be around 100MB. On my commodity hardware the latter command takes about 3 seconds to finish.

edited Mar 06 '21 at 04:29

answered Oct 29 '20 at 05:44

codemonkey

7,325
5
22
36

Thanks, it is a nice command. But how can I know the headcount for each size?? for 110 MB of file what will be the head count?? thanks. – Ravi Teja Nov 09 '20 at 12:06
It would depend on the length of the line you're putting in. For example with `Some text entry here. This will be on each line...` on each line, you would need 2300000 lines to get a file 112M in size. That took me about 3 tries to figure out. So just run that command with a random number for head count and the resultant size will guide you as to how to adjust it to get to the target size. – codemonkey Nov 09 '20 at 17:00
yeah did the same, just thought there might be something easier. Thanks – Ravi Teja Nov 13 '20 at 10:41
3

Um, no, not at all what you want for testing the effectiveness of your compressor. The resulting text is _highly_ repetitive, and does not represent what a compressor will see in the real world. Do not use this answer. See the compression corpora in the other answers. – Mark Adler Dec 13 '20 at 17:38
@MarkAdler I think I have addressed the repetition concern with the edit. – codemonkey Mar 06 '21 at 04:32
Also not useful for compression testing. All compressors will compress about the same amount, due simply to the subset of bytes present in the file. This is easily calculated to be a best case of log2(63)/8 ~= 0.747. Indeed I get 0.756 from gzip and 0.766 from xz. Normally, xz can do much better than gzip on real-world data. There is now too little redundancy, in that there are no repeated strings, nor any other way to predict text, since it's random. It is pointless to try to randomly generate data to test compressors, except for the special case of testing exactly that. – Mark Adler Mar 06 '21 at 07:59
@MarkAdler Thanks for the clinic on compression :) – codemonkey Mar 06 '21 at 18:57
2

Results in `tr: Illegal byte sequence` on MacOS. – Gary Sep 03 '21 at 19:16
5

If you are running this on Mac, you must install coreutils `brew install coreutils` then use `gtr` instead of `tr` with the same command – warvolin Mar 10 '22 at 04:56

score 23 · Accepted Answer · edited Jul 20 '22 at 20:10

23

And don't forget the collection of Corpus

The Canterbury Corpus
The Artificial Corpus
The Large Corpus
The Miscellaneous Corpus
The Calgary Corpus
The Canterbury Corpus

SEE: https://corpus.canterbury.ac.nz/descriptions/

there is a download links for the files available for each set

edited Jul 20 '22 at 20:10

Mark Adler

101,978
13
118
158

answered Jun 17 '17 at 03:42

Phillip Williams

456
2
10

1

FWIW, 1) The individual files are not available for download, only the zipped Corpus files 2)The download is not secure, so, most browsers would complain 3) The sizes of the files are displayed in bytes. – Abhijit Sarkar Jan 22 '23 at 06:18

score 20 · Answer 3 · answered Jun 13 '17 at 06:16

20

You can download enwik8 and enwik9 from here. They are respectively 100,000,000 and 1,000,000,000 bytes of text for compression benchmarks. You can always pull subsets of those for smaller tests.

answered Jun 13 '17 at 06:16

Mark Adler

101,978
13
118
158

score 2 · Answer 4 · edited Nov 01 '22 at 21:30

2

Project Gutenberg looks exceptionally promising for this purpose. This resource contains thousands of books in many formats. Here is a sample of what is available:

Clicking on any of the links will reveal the the various formats that always include Plain Text UTF-8 and .txt:

Another possible place to get large amounts of random text data for compression testing would be data dump sites such as Wiki or even Stack Exchange

I've alse found this blog post that lists 10 open source places to get complex text data for analytics testing.

There are a lot of online resources for creating large text files of arbitrary or specific sizes, such as this Lorem Ipsum Generator which I have used for script development, but I've learned that these sources are no good for compression testing because the words tend to be limited and repetitive. Consequently, these files will compress considerably more than natural text would.

edited Nov 01 '22 at 21:30

Mark Adler

101,978
13
118
158

answered Oct 28 '22 at 07:45

Justin Edwards

310
1
4
7

1

Also not representative of actual text. Do not use for testing compressors with the objective of getting compression ratios for text. – Mark Adler Oct 28 '22 at 19:19
@MarkAdler - It's obvious to me from what I see in your answer history, that you know what you're talking about, so respectfully, I'm quite interested in any explanation you would have to offer. Why would this not be a representation of actual text? To my untrained eye, the consonant to vowel ratios, sentence lengths, and paragraph structures all seem to approximate what I see in every day essay documents. – Justin Edwards Oct 29 '22 at 02:38
1

There are only 175 unique words in the generated text, no matter its length. Even 1st grade readers have more vocabulary words than that. As a result, the Lorem ipsum text will compress much better than actual English text. – Mark Adler Oct 29 '22 at 04:07
@MarkAdler - Thank you; that makes perfect sense. [I recently developed a Jython script for splitting text files](https://forum.inductiveautomation.com/t/making-text-files-less-than-3-mb/65769/11), and I was able to quickly generate large text files using that tool. It worked great for that purpose, so remembering this question, I intuitively felt like it would be of value here. I'm sure it would help anybody like me who stumble onto this post looking for large text file resources for other purposes, but to that end, what is your advice on what to do from here to improve this answer? – Justin Edwards Oct 29 '22 at 04:38
1

Delete it, or replace it with a different answer. Lorem ipsum is no help here. – Mark Adler Oct 29 '22 at 06:10
Understood, and I appreciate your time and education. I'll find a way to fix the answer over the weekend. – Justin Edwards Oct 29 '22 at 06:22
1

Much better.... – Mark Adler Nov 01 '22 at 21:30

Your Mother · Answer 5 · 2023-07-13T10:41:14.970

You can use this with python (download at python.org if you haven't already); this will generate a file fully made up of 'M's:

size = ''

while not size.isnumeric():
    size = input('How big would you like your file to be (mb)? ')

size = int(size)

name = input('Where would you like to locate the file? ')

open(name, 'w').write('')

for mb in range(size):
    open(name, 'a').write('M' * 1000000)

print('Done!')

Or, you can use this to generate a completely random file:

import random

size = ''
characters = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!\"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~"

while not size.isnumeric():
    size = input('How big would you like your file to be (mb)? ')

size = int(size)

name = input('Where would you like to locate the file? ')

open(name, 'w').write('')

for mb in range(size):
    open(name, 'a').write(''.join(random.choices(characters, k=1000000)))

print('Done!')

And finally, to generate a bunch of words with spaces between them:

import random

size = ''

words = '''
hello
goodbye
yay!
'''  # Your words list here, separated by newlines

words = words.split('\n')

wl = []

print('Interpreting list...\n')

for word in words:
    if word.strip():
        wl.append(word.strip() + ' ')
    words.remove(word)

while not size.isnumeric():
    size = input('How big would you like your file to be (\'1\' would be 1000 words)? ')

size = int(size)

name = input('What would you like to be the path to the file? ')

open(name, 'w').write('')

for word in range(size):
    open(name, 'a').write(''.join(random.choices(wl, k=1000)))

print('Done!')

Remember that some of these take longer than others; the wordslist one can take quite a while.

Looking for large text files for testing compression in all sizes

5 Answers5