1

I have a dataset that is 86 million rows x 20 columns with a header, and I need to convert it to a csv in order to dump it into big query (adding multiple tags from that). The logical solution is reading the .txt file with pd.read_csv but I don't have 86 million rows of memory on my device and it will crash jupyter.

I'm aware of other threads such as (How to convert a tab delimited text file to a csv file in Python) but my issue seems rather niche.

Is there a way I could go about this? I thought about Vaex but I have total unfamiliarity with the toolkit, and it doesn't seem to have a writer within.

Current thoughts would be:

csv_path = r'csv_test.csv'
txt_path = r'txt_test.txt'

with open(txt_path, "r") as in_text:
    in_reader = csv.reader(in_text, delimiter="|", skipinitialspace=True)
    with open(csv_path, "w") as out_csv:
        out_writer = csv.writer(out_csv, delimiter = ',')
        for row in in_reader:
            out_writer.writerow(row)

Currently, I am receiving an error stating: Error: field larger than field limit (131072)

It seems it's the maximum row count in a single column, so I'm quite a bit off. I've gotten a csv of smaller files to generate (only using 3 of the 35 total .txt files) but when I attempt to use all, it fails with code above. Update: I have expanded the sys.maxsize and am still receiving this same error

I have no way to verify if this works due to the sheer size of the dataset, but it seems like it /should/ work. Trying to read it with Vaex would work if I wasn't getting parsing errors due to there being commas within the data.

So I have 3 questions:

  1. Is there a way I can write a larger sized csv?

  2. Is there a way to dump in the large pipe delimited .txt file to Big Query in chunks as different csv's?

  3. Can I dump 35 csv's into Big Query in one upload?

Edit:

here is a short dataframe sample:

|CMTE_ID| AMNDT_IND| RPT_TP| TRANSACTION_PGI| IMAGE_NUM| TRANSACTION_TP| ENTITY_TP| NAME| CITY| STATE| ZIP_CODE| EMPLOYER| OCCUPATION| TRANSACTION_DT| TRANSACTION_AMT| OTHER_ID| TRAN_ID| FILE_NUM| MEMO_CD| MEMO_TEXT| SUB_ID
0|C00632562|N|M4|P|202204139496092475|15E|IND|NAME, NAME|PALO ALTO|CA|943012820.0|NOT EMPLOYED|RETIRED|3272022|5|C00401224|VTEKDYJ78M3|1581595||* EARMARKED CONTRIBUTION: SEE BELOW|4041920221470955005
1|C00632562|N|M4|P|202204139496092487|15E|IND|NAME, NAME|DALLAS|TX|752054324.0|SELF EMPLOYED|PHOTOGRAPHER|3272022|500|C00401224|VTEKDYJ7BD4|1581595||* EARMARKED CONTRIBUTION: SEE BELOW|4041920221470955041
birdman
  • 249
  • 1
  • 13
  • How big is the largest field/column, in chars? Can you edit your post and include a sample of the header and first row? I mocked up a field that's 131073 characters, and I get that error, so _if_ you have a field that is that large then we'll have to figure something out. But, a .1 MB field doesn't seem correct, so I assume there's some misconfiguration between the actual text and the csv reader. – Zach Young Oct 12 '22 at 01:57
  • Great! So, you're data is "pipe delimited", not tab delimited, so try this config for your reader, `reader = csv.reader(f, delimiter="|", skipinitialspace=True)`. I added skipinitialspace because at least some of your headers have leading spaces. – Zach Young Oct 12 '22 at 04:01
  • Apologies, updated the OP again. Still receiving same error. – birdman Oct 12 '22 at 05:13
  • I've run your code from above, with that 3-line sample as txt_path, and it properly converted to a CSV. I did have to change the var name txt_file to txt_path, in the call to open(). – Zach Young Oct 12 '22 at 05:36
  • I've gotten it to run as well, just only seems to happen when it's at extremes. – birdman Oct 12 '22 at 16:55
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/248763/discussion-between-zach-young-and-birdman). – Zach Young Oct 12 '22 at 17:13

1 Answers1

1

I think there is some red-herring going on here:

Is there a way I can write a larger sized csv?

Yes, the reader and writer iterator style should be able to read any length of file, they step through incrementally, and at no stage do they attempt to read the whole file. Something else is going wrong in your example.

Is there a way to dump in the large tab-delimited .txt file to Big Query in chunks as different csv's?

You shouldn't need to.

Can I dump 35 csv's into Big Query in one upload?

That's more a Big Query api question, so I wont attempt to answer that here.

In your code, your text delimiter is set to a pipe, but in your question number 2, you describe it as being tab delimited. If you're giving the wrong delimiter to the code, it might try to read more content into a field than it's expecting, and fail when it hits some field-size limit. This sounds like it might be what's going on in your case.

Also, watch out when piping your file out and changing delimiters - in the data sample you post, there are some commas embedded in the text, this might result in a corrupted file when it comes to reading it in again on the other side. Take some time to think about your target CSV dialect, in terms of text quoting, chosen delimiters etc.

Try replacing the | with \t and see if that helps.

If you're only changing the delimiter from one thing to another, is that a useful process? Maybe forget the whole CSV nature of the file, and read lines iteratively, and write them without modifying them any, you could use readline and writeline for this, probably speeding things up in the process. Again, because they're iterative, you wont have to worry about loading the whole file into RAM, and just stream from one source to your target. Beware how long it might take to do this, and if you've a patchy network, it can all go horribly wrong. But at least it's a different error!

Thomas Kimber
  • 10,601
  • 3
  • 25
  • 42