1

I want to test my wordcount software based on MapReduce framework with a very large file (over 1GB) but I don't know how can I generate it.

Are there any tools to create a large file with random but sensible english sentences? Thanks

Antonio1996
  • 736
  • 2
  • 7
  • 22

2 Answers2

1

A simple python script can create a Pseudo-random document of words. I have the one I wrote up for just a task a year ago:

import random

file1 = open("test.txt","a") 
PsudoRandomWords = ["Apple ", "Banana ", "Tree ", "Pickle ", "Toothpick ", "Coffee ", "Done "]

index = 0
#Increase the range to make a bigger file
for x in range(150000000):
   #Change end range of the randint function below if you add more words
   index = random.randint(0,6)
   file1.write(PsudoRandomWords[index])
   if x % 20 == 0:
      file1.write('\n')`

Just add more words to the list to make it more random and increase the index of the random function. I just tested it and it should create a document named test.txt at exactly one gigabyte. This will contain words from the list in a random order separated by a new line every 20 words.

Peyton
  • 66
  • 5
1

I wrote this simple Python script that scrape on Project Gutenberg site and write the text (encoding: us-ascii, if you want to use others see http://www.gutenberg.org/files/) in a local file text. This script can be used in combination with https://github.com/c-w/gutenberg to do more accurate filtering (by language, by author etc.)

from __future__ import print_function

import requests
import sys

if (len(sys.argv)!=2):
        print("[---------- ERROR ----------] Usage: scraper <number_of_files>", file=sys.stderr)
        sys.exit(1)

number_of_files=int(sys.argv[1])
text_file=open("big_file.txt",'w+')

for i in range(number_of_files):
    url='http://www.gutenberg.org/files/'+str(i)+'/'+str(i)+'.txt'
    resp=requests.get(url)
    if resp.status_code!=200:
        print("[X] resp.status_code =",resp.status_code,"for",url)
        continue
    print("[V] resp.status_code = 200 for",url)
    try:    
        content=resp.text

        #dummy cleaning of the text 
        splitted_content=content.split("*** START OF THIS PROJECT GUTENBERG EBOOK")
        splitted_content=splitted_content[1].split("*** END OF THIS PROJECT GUTENBERG EBOOK")
        print(splitted_content[0], file = text_file)
    except: 
        continue

text_file.close()
Antonio1996
  • 736
  • 2
  • 7
  • 22