I have a large csv with three strings per row in this form:
a,c,d
c,a,e
f,g,f
a,c,b
c,a,d
b,f,s
c,a,c
I read in the first two columns recode the strings to integers and then remove duplicates counting how many copies of each row there were as follows:
import pandas as pd
df = pd.read_csv("test.csv", usecols=[0,1], prefix="ID_", header=None)
letters = set(df.values.flat)
df.replace(to_replace=letters, value=range(len(letters)), inplace=True)
df1 = df.groupby(['ID_0', 'ID_1']).size().rename('count').reset_index()
print df1
This gives:
ID_0 ID_1 count
0 0 1 2
1 1 0 3
2 2 4 1
3 4 3 1
which is exactly what I need.
However as my data is large I would like to make two improvements.
- How can I do the groupby and then recode instead of the other way round? The problem is that I can't do
df1[['ID_0','ID_0']].replace(to_replace=letters, value=range(len(letters)), inplace = True)
. This gives the error"A value is trying to be set on a copy of a slice from a DataFrame"
- How can I avoid creating df1? That is do the whole thing inplace.