I have 8GB csv files and 8GB of RAM. Each file has two strings per row in this form:
a,c
c,a
f,g
a,c
c,a
b,f
c,a
For smaller files I remove duplicates counting how many copies of each row there were in the first two columns and then recode the strings to integers as follows:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
df = pd.read_csv("file.txt", header=None, prefix="ID_")
# Perform the groupby (before converting letters to digits).
df = df.groupby(['ID_0', 'ID_1']).size().rename('count').reset_index()
# Initialize the LabelEncoder.
le = LabelEncoder()
le.fit(df[['ID_0', 'ID_1']].values.flat)
# Convert to digits.
df[['ID_0', 'ID_1']] = df[['ID_0', 'ID_1']].apply(le.transform)
This gives:
ID_0 ID_1 count
0 0 1 2
1 1 0 3
2 2 4 1
3 4 3 1
which is exactly what I need for this toy example.
For the larger file I can't take these steps because of lack of RAM.
I can imagine it is possible to combine unix sort and a bespoke python solution doing multiple passes over the data to process my data set. But someone suggested dask might be suitable. Having read the docs I am still not clear.
Can dask be used to do this sort of out of core processing or is there some other out of core pandas solution?