3

I am dealing with large (20-100GB) tab delimited text files which I am able to import correctly into pandas with the index_col=False option. The Dask dataframe does not support the index_col parameter. I am able to work around, but curious if there is a smarter way to deal with this. Without the index_col=False option, pandas and dask read the columns shifted once to the right - i.e. Col1 aligned to data col2, etc. I believe it is because the data rows have a trailing tab but the header does not.

Here is a sample of the file - ^I is a tab and $ a line ending.

Col1^ICol2^ICol3^ICol4^ICol5^ICol6^ICol7^ICol8$
0^I0^ICODE^I-0.2628^I3.041e-001^I.^I0^I2.1213^I$

Update: This shows the behavior I want and do not want.

TESTDATA = """Index1\tIndex2\tCodeId\tCol4\tCol5\tCol6\tCol7\tCol8
0\t0\tCODE\t-0.2628\t3.041e-001\t.\t0\t2.1213\t
"""
TESTFILE = "test.txt"
with open(TESTFILE, "w") as text_file:
    text_file.write(TESTDATA)
import pandas as pd

df = pd.read_csv(TESTFILE,sep='\t')
print("INCORRECT - Col8 is NaN")
print(df.head())
df = pd.read_csv(TESTFILE,sep='\t', index_col=False)
print("CORRECT - index and code correct and Col4-8 correct")
print(df.head())

Results:

INCORRECT - Col8 is NaN
   Index1 Index2  CodeId    Col4 Col5  Col6    Col7  Col8
0       0   CODE -0.2628  0.3041    .     0  2.1213   NaN
CORRECT - index and code correct and Col4-8 correct
   Index1  Index2 CodeId    Col4    Col5 Col6  Col7    Col8
0       0       0   CODE -0.2628  0.3041    .     0  2.1213

Dropping the first column just removes a column which I actually need.

import dask.dataframe as dd

ddf = dd.read_csv(TESTFILE,sep='\t')
print(ddf.compute())

ddf = dd.read_csv(TESTFILE,sep='\t')
res = ddf.drop(columns=['Index1'])
print(res.compute())

Results for both are not what I am looking for.

Also, thanks for your answer, my workaround is similar to the one you propose. It's just that as I share dask with colleagues I hope there are minimal behavior differences between pandas and dask dfs - just checking if I miss something.

Finally, here is my own workaround:

# manage trailing delimiter issue with pandas 
# which is more complicated with dask since 
# index_col cannot be used
def ddf_cols_mod(file, sep):
    df = dd.read_csv(file, sep=sep, nrows=0)
    cols = list(df.columns)
    cols.append('None')
    return cols

ddf = dd.read_csv(file,
                  sep='\t',
                  header=0,
                  names=ddf_cols_mod(file, '\t'))
user2234151
  • 131
  • 1
  • 6

0 Answers0