I am dealing with large (20-100GB) tab delimited text files which I am able to import correctly into pandas with the index_col=False
option. The Dask dataframe does not support the index_col
parameter. I am able to work around, but curious if there is a smarter way to deal with this. Without the index_col=False
option, pandas and dask read the columns shifted once to the right - i.e. Col1 aligned to data col2, etc. I believe it is because the data rows have a trailing tab but the header does not.
Here is a sample of the file - ^I
is a tab and $
a line ending.
Col1^ICol2^ICol3^ICol4^ICol5^ICol6^ICol7^ICol8$
0^I0^ICODE^I-0.2628^I3.041e-001^I.^I0^I2.1213^I$
Update: This shows the behavior I want and do not want.
TESTDATA = """Index1\tIndex2\tCodeId\tCol4\tCol5\tCol6\tCol7\tCol8
0\t0\tCODE\t-0.2628\t3.041e-001\t.\t0\t2.1213\t
"""
TESTFILE = "test.txt"
with open(TESTFILE, "w") as text_file:
text_file.write(TESTDATA)
import pandas as pd
df = pd.read_csv(TESTFILE,sep='\t')
print("INCORRECT - Col8 is NaN")
print(df.head())
df = pd.read_csv(TESTFILE,sep='\t', index_col=False)
print("CORRECT - index and code correct and Col4-8 correct")
print(df.head())
Results:
INCORRECT - Col8 is NaN
Index1 Index2 CodeId Col4 Col5 Col6 Col7 Col8
0 0 CODE -0.2628 0.3041 . 0 2.1213 NaN
CORRECT - index and code correct and Col4-8 correct
Index1 Index2 CodeId Col4 Col5 Col6 Col7 Col8
0 0 0 CODE -0.2628 0.3041 . 0 2.1213
Dropping the first column just removes a column which I actually need.
import dask.dataframe as dd
ddf = dd.read_csv(TESTFILE,sep='\t')
print(ddf.compute())
ddf = dd.read_csv(TESTFILE,sep='\t')
res = ddf.drop(columns=['Index1'])
print(res.compute())
Results for both are not what I am looking for.
Also, thanks for your answer, my workaround is similar to the one you propose. It's just that as I share dask with colleagues I hope there are minimal behavior differences between pandas and dask dfs - just checking if I miss something.
Finally, here is my own workaround:
# manage trailing delimiter issue with pandas
# which is more complicated with dask since
# index_col cannot be used
def ddf_cols_mod(file, sep):
df = dd.read_csv(file, sep=sep, nrows=0)
cols = list(df.columns)
cols.append('None')
return cols
ddf = dd.read_csv(file,
sep='\t',
header=0,
names=ddf_cols_mod(file, '\t'))