7

I have a large csv file with lines that looks like

stringa,stringb
stringb,stringc
stringd,stringa

I need to convert it so the ids are consecutively numbered from 0. In this case the following would work

0,1
1,2
3,0

My current code looks like:

import csv
names = {}
counter = 0
with open('foo.csv', 'rb') as csvfile:
     reader = csv.reader(csvfile)
     for row in reader:
         if row[0] in names:
             id1 =  row[0]
         else:
             names[row[0]] = counter
             id1 = counter
             counter += 1
         if row[1] in names:
             id2 = row[1]
         else:
             names[row[1]] = counter
             id2 = counter
             counter += 1
     print id1, id2

Python dicts use a lot of memory sadly and my input is large.

What can I do when the input is too large for the dict to fit in memory

I would also be interested if there is a better/faster way to solve this problem in general.

piRSquared
  • 285,575
  • 57
  • 475
  • 624
Simd
  • 19,447
  • 42
  • 136
  • 271
  • The general strategy of using a dictionary/hash-map is correct, though you are doing it a little awkwardly. When you say your input is too big to fit in memory, what are we talking here? Without some lookup table (the dict) or other authoritative reference, you won't be able to guarantee uniqueness or consecutivity. – Nathaniel Ford Sep 06 '16 at 19:20
  • @NathanielFord I would love to know a less awkward way to start with. – Simd Sep 06 '16 at 19:22
  • When you say, consecutively, from the example you gave are you wanting to have an output 0,1,2,3 etc? – Sreejith Menon Sep 06 '16 at 19:23
  • @SreejithMenon It just means that the numbers for the ids have to start at 0 and can't have any gaps in them. So, yes. – Simd Sep 06 '16 at 19:24

3 Answers3

6
df = pd.DataFrame([['a', 'b'], ['b', 'c'], ['d', 'a']])

v = df.stack().unique()
v.sort()
f = pd.factorize(v)
m = pd.Series(f[0], f[1])

df.stack().map(m).unstack()

enter image description here

piRSquared
  • 285,575
  • 57
  • 475
  • 624
3

UPDATE: here is a memory saving solution, which converts all your string to numerical categories:

In [13]: df
Out[13]:
        c1       c2
0  stringa  stringb
1  stringb  stringc
2  stringd  stringa
3  stringa  stringb
4  stringb  stringc
5  stringd  stringa
6  stringa  stringb
7  stringb  stringc
8  stringd  stringa

In [14]: x = (df.stack()
   ....:        .astype('category')
   ....:        .cat.rename_categories(np.arange(len(df.stack().unique())))
   ....:        .unstack())

In [15]: x
Out[15]:
  c1 c2
0  0  1
1  1  2
2  3  0
3  0  1
4  1  2
5  3  0
6  0  1
7  1  2
8  3  0

In [16]: x.dtypes
Out[16]:
c1    category
c2    category
dtype: object

OLD answer:

I think a you can categorize your columns:

In [63]: big.head(15)
Out[63]:
         c1       c2
0   stringa  stringb
1   stringb  stringc
2   stringd  stringa
3   stringa  stringb
4   stringb  stringc
5   stringd  stringa
6   stringa  stringb
7   stringb  stringc
8   stringd  stringa
9   stringa  stringb
10  stringb  stringc
11  stringd  stringa
12  stringa  stringb
13  stringb  stringc
14  stringd  stringa

In [64]: big.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000000 entries, 0 to 29999999
Data columns (total 2 columns):
c1    object
c2    object
dtypes: object(2)
memory usage: 457.8+ MB

So big DF has 30M rows and it's size is approx. 460MiB...

Let's categorize it:

In [65]: cat = big.apply(lambda x: x.astype('category'))

In [66]: cat.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000000 entries, 0 to 29999999
Data columns (total 2 columns):
c1    category
c2    category
dtypes: category(2)
memory usage: 57.2 MB

It takes now only 57MiB and looks exactly the same:

In [69]: cat.head(15)
Out[69]:
         c1       c2
0   stringa  stringb
1   stringb  stringc
2   stringd  stringa
3   stringa  stringb
4   stringb  stringc
5   stringd  stringa
6   stringa  stringb
7   stringb  stringc
8   stringd  stringa
9   stringa  stringb
10  stringb  stringc
11  stringd  stringa
12  stringa  stringb
13  stringb  stringc
14  stringd  stringa

let's compare it's size with similar numeric DF:

In [67]: df = pd.DataFrame(np.random.randint(0,5,(30000000,2)), columns=list('ab'))

In [68]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000000 entries, 0 to 29999999
Data columns (total 2 columns):
a    int32
b    int32
dtypes: int32(2)
memory usage: 228.9 MB
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
3

You can use factorize if you want an array of id's:

df = pd.read_csv(data, header=None, prefix='Col_')
print (pd.factorize(np.hstack(df.values)))

(array([0, 1, 1, 2, 3, 0]), array(['stringa', 'stringb', 'stringc', 'stringd'], dtype=object))

EDIT : (as per the comment)

You could take the slices of the tuple obtained after the factorize method and map accordingly to the entire dataframe by replacing one another as shown:

num, letter = pd.factorize(np.hstack(df.values))

df.replace(to_replace=sorted(list(set(letter))), value=sorted(list(set(num))))

   Col_0  Col_1
0      0      1
1      1      2
2      3      0
Nickil Maveli
  • 29,155
  • 8
  • 82
  • 85
  • I really need the output to be in the same format as the input just with the ids renamed. – Simd Sep 06 '16 at 19:32