0

I have a large csv with three strings per row in this form:

a,c,d
c,a,e
f,g,f
a,c,b
c,a,d
b,f,s
c,a,c

I read in the first two columns recode the strings to integers and then remove duplicates counting how many copies of each row there were as follows:

import pandas as pd
df = pd.read_csv("test.csv", usecols=[0,1], prefix="ID_", header=None)
letters = set(df.values.flat)
df.replace(to_replace=letters, value=range(len(letters)), inplace=True)
df1 = df.groupby(['ID_0', 'ID_1']).size().rename('count').reset_index()
print df1

This gives:

   ID_0  ID_1  count
0     0     1      2
1     1     0      3
2     2     4      1
3     4     3      1

which is exactly what I need.

However as my data is large I would like to make two improvements.

  • How can I do the groupby and then recode instead of the other way round? The problem is that I can't do df1[['ID_0','ID_0']].replace(to_replace=letters, value=range(len(letters)), inplace = True). This gives the error

    "A value is trying to be set on a copy of a slice from a DataFrame"

  • How can I avoid creating df1? That is do the whole thing inplace.
Sociopath
  • 13,068
  • 19
  • 47
  • 75
Simd
  • 19,447
  • 42
  • 136
  • 271

2 Answers2

3

I like to use sklearn.preprocessing.LabelEncoder to do the letter to digit conversion:

from sklearn.preprocessing import LabelEncoder

# Perform the groupby (before converting letters to digits).
df = df.groupby(['ID_0', 'ID_1']).size().rename('count').reset_index()

# Initialize the LabelEncoder.
le = LabelEncoder()
le.fit(df[['ID_0', 'ID_1']].values.flat)

# Convert to digits.
df[['ID_0', 'ID_1']] = df[['ID_0', 'ID_1']].apply(le.transform)

The resulting output:

   ID_0  ID_1  count
0     0     2      2
1     1     3      1
2     2     0      3
3     3     4      1

If you want to convert back to letters at a later point in time, you can use le.inverse_transform:

df[['ID_0', 'ID_1']] = df[['ID_0', 'ID_1']].apply(le.inverse_transform)

Which maps back as expected:

  ID_0 ID_1  count
0    a    c      2
1    b    f      1
2    c    a      3
3    f    g      1

If you just want to know which digit corresponds to which letter, you can look at the le.classes_ attribute. This will give you an array of letters, which is indexed by the digit it encodes to:

le.classes_ 

['a' 'b' 'c' 'f' 'g']

For a more visual representation, you can cast as a Series:

pd.Series(le.classes_)

0    a
1    b
2    c
3    f
4    g

Timings

Using a larger version of the sample data and the following setup:

df2 = pd.concat([df]*10**5, ignore_index=True)

def root(df):
    df = df.groupby(['ID_0', 'ID_1']).size().rename('count').reset_index()
    le = LabelEncoder()
    le.fit(df[['ID_0', 'ID_1']].values.flat)
    df[['ID_0', 'ID_1']] = df[['ID_0', 'ID_1']].apply(le.transform)
    return df

def pir2(df):
    unq = np.unique(df)
    mapping = pd.Series(np.arange(unq.size), unq)
    return df.stack().map(mapping).unstack() \
      .groupby(df.columns.tolist()).size().reset_index(name='count')

I get the following timings:

%timeit root(df2)
10 loops, best of 3: 101 ms per loop

%timeit pir2(df2)
1 loops, best of 3: 1.69 s per loop
root
  • 32,715
  • 6
  • 74
  • 87
2

New Answer

unq = np.unique(df)
mapping = pd.Series(np.arange(unq.size), unq)

df.stack().map(mapping).unstack() \
  .groupby(df.columns.tolist()).size().reset_index(name='count')

enter image description here

Old Answer

df.stack().rank(method='dense').astype(int).unstack() \
  .groupby(df.columns.tolist()).size().reset_index(name='count')

enter image description here

piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • I need to store the mapping from the original strings to integers. This is given explicitly in the method in my question. Can you get it from yours as well? – Simd Sep 09 '16 at 20:02
  • Small thing: I need the ids to start at 0, not 1. – Simd Sep 09 '16 at 20:06