2

I have 3 files in excess in 10GB that I need to split into 6 smaller files. I would normally use something like R to load the files and partition it into smaller chunks but the size of the files prevents them from being read into R - even with 20GB of RAM.

I'm stuck on how to proceed next and would greatly appreciate any tips.

meb33
  • 31
  • 1
  • 5
  • Can you open the files in Excel? Do you know the exact size of the file?Try converting them into .csv files and then using `split` function for each in unix as `split -b 5368709120 _` This is assuming the files are exactly 10GB and you need 2 parts for each of the three. Get the file size by using `ls -l ` The result of this would be you getting two new files as `_a` and `_b` in the same directory – jigsawmnc Oct 28 '18 at 04:57
  • Hint: use file system related OS function (e.g. stat) to find the size of the file. divide by three, and read in the first file in small chunks and write out to new files on line boundary around the 1/3 size. Then switch to new 1/3 file, etc. – Keith Oct 28 '18 at 05:19
  • If you’re working with large amounts of data, it might help to use a framework for distributed computations on Big Data, such as Apache Spark (which has APIs in Python, R, Scala and Java). You can use it to achieve your current goal of splitting the csv into smaller files with something like `spark.read.csv(input_path).repartition(num_output_files).write.csv(output_path)`. But you can also use it for so much more, like distributing your computation instead of explicitly dealing with splitting the data into smaller chunks. – Zohar Meir Oct 28 '18 at 05:34
  • @jigsawmnc unfortunately these files exceed the row limit in Excel. They are each in excess of 250M rows. I don't have a distributed files system that I can use unfortunately. – meb33 Oct 28 '18 at 19:52
  • @meb33 So I can't verify whether split will work for these large files. Even without saving the file as a csv you can try split on them using either the bytes or lines as the criteria for your split. If you want you can try split with a manageable number of lines for the file like 1000 in this way: `split -l 1000 _` This will yield T/1000 number of files in your directory, where T is the number of lines in your file. Then you can merge these files using `cat` : `cat file_1 file_2 ... > final_file` – jigsawmnc Oct 28 '18 at 22:10

1 Answers1

2

In python, with use of generators/iterators you shouldn't load all the data in a memory.

Just read it line-by-line.

Csv library gives you a reader and writer classes, that would do the job.

To split your file you can write something like this:

import csv

# your input file (10GB)
in_csvfile = open('source.csv', "r")

# reader, that would read file for you line-by-line
reader = csv.DictReader(in_csvfile)

# number of current line read
num = 0

# number of output file
output_file_num = 1

# your output file
out_csvfile = open('out_{}.csv'.format(output_file_num), "w")

# writer should be constructed in a read loop, 
# because we need csv headers to be already available 
# to construct writer object
writer = None

for row in reader:
    num += 1

    # Here you have your data line in a row variable

    # If writer doesn't exists, create one
    if writer is None:
        writer = csv.DictWriter(
            out_csvfile, 
            fieldnames=row.keys(), 
            delimiter=",", quotechar='"', escapechar='"', 
            lineterminator='\n', quoting=csv.QUOTE_NONNUMERIC
        )

    # Write a row into a writer (out_csvfile, remember?)
    writer.writerow(row)

    # If we got a 10000 rows read, save current out file
    # and create a new one
    if num > 10000:
        output_file_num += 1
        out_csvfile.close()
        writer = None

        # create new file
        out_csvfile = open('out_{}.csv'.format(output_file_num), "w")

        # reset counter
        num = 0 

# Closing the files
in_csvfile.close()
out_csvfile.close()

I've didn't tested it, writed out of my head, so, bugs can exists :)

MihanEntalpo
  • 1,952
  • 2
  • 14
  • 31
  • I looked at some similar solutions here :https://stackoverflow.com/questions/3094866/trimming-a-huge-3-5-gb-csv-file-to-read-into-r#3108397. And here: https://stackoverflow.com/questions/2321333/how-to-read-in-a-text-file-to-r-using-textconnection-and-librarysqldf/2328295#2328295. But what I don't get is how, after reading in X number of rows from one file, I can be sure I am picking up the remaining rows when going through the next iteration. – meb33 Oct 28 '18 at 19:53
  • Did you read my code? When num gets past 10000 (for example), we are closing previous file, and opening next one, and writing continues, but, to the next file. – MihanEntalpo Oct 29 '18 at 01:27
  • My apologies - I missed that. Let me try running this tonight. – meb33 Oct 29 '18 at 02:08
  • You can create file of 1000 lines, and split it to a files of 50 lines each. This would allow you to test script, but not to wait for a hours while it's finished. – MihanEntalpo Oct 29 '18 at 07:23
  • That works beautifully! My only regret is that my reputation is below the voting threshold or I would upvote the submission. Thank you again. – meb33 Oct 29 '18 at 17:34
  • Anyway, you could accept answer with any repotation level – MihanEntalpo Oct 30 '18 at 03:14