Pandas .DAT file import error with skip rows

Question

I am trying to break a huge data file into smaller parts. I am using the following scripts -

 df = pd.read_csv(file_name, header=None,encoding='latin1',sep='\t',nrows=100000, skiprows = 100000)

but I see that skip rows argument skips around 200000 rows instead of 100000 can anyone tell me as to why this is happening

why not just specify a `chunksize=100000` which will return you a slice of the df, so you can then split the df for each chunk? — EdChum, Dec 01 '16 at 15:43
The actual file is 190 gb I won't be able to read it all into memory at once — Uasthana, Dec 01 '16 at 15:46
You don't need to with `chunksize` this will just read the next chunksize rows, you can then do whatever you want with that chunk — EdChum, Dec 01 '16 at 15:48
So I am implementing chunksize on a sample file of 2.3 gb with 15041273 rows and I am trying to read it using chunksize with the follwoing code `tp = pd.read_csv(r'E:\Utkarsh\Test\feed_3068_20160920_20160927.dat',header=None,encoding='latin1', sep='\t', iterator=True, chunksize=100000) df = pd.concat(tp, ignore_index=True) print(df)` but the df has the entire data loaded into it instead of the 100000 rows of the chunksize. Am I not implementing it right? — Uasthana, Dec 01 '16 at 16:00
190 GB file? Sounds like a job for a big data platform like Spark/Hadoop, not Pandas — gold_cy, Dec 01 '16 at 16:07
Yeah it will ultimately be processed in redshift, but redshift works better with hundred files of 1 gb rather than one 100 gb file. Hence the efforts to break it — Uasthana, Dec 01 '16 at 16:12
@EdChum Hey man I have it working with chunksize, thanks for nudging me in the right direction. Have a good one!! — Uasthana, Dec 01 '16 at 16:16

score 1 · Accepted Answer · answered Dec 01 '16 at 16:18

Thanks to @EdChum I was able to solve the problem using chunksize with the following code:-

i = 0
tp = pd.read_csv(filename,header=None,encoding='latin1', sep='\t', iterator=True, chunksize=1000000)
for c in tp:
    ca = pd.DataFrame(c)
    ca.to_csv (file_destination +str(i)+'test.csv', index = False, header = False)
    i = i+1

Pandas .DAT file import error with skip rows

1 Answers1