0

I recently downloaded a CSV that turned out larger than I anticipated (the size wasn't available until the download finished). The file is >100 GB and my drive only has around 25 GB free at this point.

Since CSV is not very space efficient, I'm wondering if there's a way to 'stream' the data from CSV to a more compressed format? That is, read the CSV by chunks, write the chunks to feather or parquet, delete the written lines from the CSV, and continue until the CSV has been 'emptied.'

A python solution would be ideal.

Khashir
  • 341
  • 3
  • 20
  • you could read in the csv with chunks using pandas https://www.folkstalk.com/2022/10/pandas-read-chunk-of-csv-with-code-examples.html#:~:text=the%20previous%20result.-,How%20do%20you%20read%20data%20in%20Panda%20chunks%3F,the%20whole%20DataFrame%20at%20once. and then use pandsa to write to a feather file https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_feather.html – Andrew Ryan Dec 08 '22 at 04:57
  • there is no good way to "delete the written lines from the CSV". your best bet is to free up disk space (or get a second drive). if you're running on a platform that supports volume compression, consider turning it on. – jdigital Dec 08 '22 at 05:01
  • @jdigital: Bummer--without the deletion piece the whole thing falls apart :P – Khashir Dec 08 '22 at 05:04
  • What's your estimation for the compressed file size? – Kelly Bundy Dec 08 '22 at 05:20
  • From https://posit.co/blog/speed-up-data-analytics-with-parquet-files/, it looks like a parquet file of 100 GB would be around 16 GB; while feather would be around 50 GB. In theory this means that the parquet file would fit without deleting anything, but that doesn't take into account the space needed to process the data. – Khashir Dec 08 '22 at 05:30
  • I meant "a CSV file of 100 GB would be a 16 GB parquet file" – Khashir Dec 08 '22 at 05:36
  • @jdigital How about going in reverse, moving chunks into separate files (say one gig each) and truncating the source file as you go. That should allowing splitting the behemoth into more manageable pieces. Compress the smaller files as well to get more breathing space. Once that's done, it should be a lot less painful to convert into some less wasteful format. | I'd probably do a regular pass through the file to identify split points (so you don't end splitting in the middle of a row) in the beginning. – Dan Mašek Dec 08 '22 at 17:00
  • 1
    @DanMašek: That sounds like it would do the trick, with truncation being the key piece. – Khashir Dec 08 '22 at 19:44
  • @KellyBundy sorry forgot to tag you in my reply--a 100 GB CSV would be around 16 GB as .parquet, or 50 GB as .feather – Khashir Dec 08 '22 at 19:44
  • Also, a lot of compression libraries (well, maybe not in Python, but the low-level ones) support streaming compression and decompression -- you feed it smallish chunks, and get smallish compressed chunks out in a loop. LZMACompressor/LZMADecompressor in Python seem to support this method, for example. A CSV ought to easily compress to less than 25% of the original size. So you can probably compress using a command line tool, make a lot of more free space, and then decompress and parse the CSV lines in small chunks. – Dan Mašek Dec 08 '22 at 19:55
  • 1
    Thank @DanMašek--would you be able to write up an answer of how to do this, either in the command line or python? I'll take a look and fiddle with it, sounds promising. – Khashir Dec 08 '22 at 23:59
  • 1
    (I just checked and have LZMA installed in Python) – Khashir Dec 09 '22 at 00:00
  • Maybe later, I'm rather busy right now. If possible, can you [edit] your question and paste an example line or two from the CSV? Just so I can synthesize something similar over here to test with, if I do get to writing something up. – Dan Mašek Dec 09 '22 at 02:47
  • No worries, I appreciate it. I can't actually provide an example because the CSV has >300 columns ;\ – Khashir Dec 10 '22 at 04:07

0 Answers0