1

I am trying to break a huge data file into smaller parts. I am using the following scripts -

 df = pd.read_csv(file_name, header=None,encoding='latin1',sep='\t',nrows=100000, skiprows = 100000)

but I see that skip rows argument skips around 200000 rows instead of 100000 can anyone tell me as to why this is happening

Uasthana
  • 1,645
  • 5
  • 16
  • 24
  • why not just specify a `chunksize=100000` which will return you a slice of the df, so you can then split the df for each chunk? – EdChum Dec 01 '16 at 15:43
  • The actual file is 190 gb I won't be able to read it all into memory at once – Uasthana Dec 01 '16 at 15:46
  • You don't need to with `chunksize` this will just read the next chunksize rows, you can then do whatever you want with that chunk – EdChum Dec 01 '16 at 15:48
  • So I am implementing chunksize on a sample file of 2.3 gb with 15041273 rows and I am trying to read it using chunksize with the follwoing code `tp = pd.read_csv(r'E:\Utkarsh\Test\feed_3068_20160920_20160927.dat',header=None,encoding='latin1', sep='\t', iterator=True, chunksize=100000) df = pd.concat(tp, ignore_index=True) print(df)` but the df has the entire data loaded into it instead of the 100000 rows of the chunksize. Am I not implementing it right? – Uasthana Dec 01 '16 at 16:00
  • 190 GB file? Sounds like a job for a big data platform like Spark/Hadoop, not Pandas – gold_cy Dec 01 '16 at 16:07
  • Yeah it will ultimately be processed in redshift, but redshift works better with hundred files of 1 gb rather than one 100 gb file. Hence the efforts to break it – Uasthana Dec 01 '16 at 16:12
  • @EdChum Hey man I have it working with chunksize, thanks for nudging me in the right direction. Have a good one!! – Uasthana Dec 01 '16 at 16:16

1 Answers1

1

Thanks to @EdChum I was able to solve the problem using chunksize with the following code:-

i = 0
tp = pd.read_csv(filename,header=None,encoding='latin1', sep='\t', iterator=True, chunksize=1000000)
for c in tp:
    ca = pd.DataFrame(c)
    ca.to_csv (file_destination +str(i)+'test.csv', index = False, header = False)
    i = i+1
Uasthana
  • 1,645
  • 5
  • 16
  • 24