How to remap ids to consecutive numbers quickly

Question

I have a large csv file with lines that looks like

stringa,stringb
stringb,stringc
stringd,stringa

I need to convert it so the ids are consecutively numbered from 0. In this case the following would work

0,1
1,2
3,0

My current code looks like:

import csv
names = {}
counter = 0
with open('foo.csv', 'rb') as csvfile:
     reader = csv.reader(csvfile)
     for row in reader:
         if row[0] in names:
             id1 =  row[0]
         else:
             names[row[0]] = counter
             id1 = counter
             counter += 1
         if row[1] in names:
             id2 = row[1]
         else:
             names[row[1]] = counter
             id2 = counter
             counter += 1
     print id1, id2

Python dicts use a lot of memory sadly and my input is large.

What can I do when the input is too large for the dict to fit in memory

I would also be interested if there is a better/faster way to solve this problem in general.

The general strategy of using a dictionary/hash-map is correct, though you are doing it a little awkwardly. When you say your input is too big to fit in memory, what are we talking here? Without some lookup table (the dict) or other authoritative reference, you won't be able to guarantee uniqueness or consecutivity. — Nathaniel Ford, Sep 06 '16 at 19:20
@NathanielFord I would love to know a less awkward way to start with. — Simd, Sep 06 '16 at 19:22
When you say, consecutively, from the example you gave are you wanting to have an output 0,1,2,3 etc? — Sreejith Menon, Sep 06 '16 at 19:23
@SreejithMenon It just means that the numbers for the ids have to start at 0 and can't have any gaps in them. So, yes. — Simd, Sep 06 '16 at 19:24

score 6 · Accepted Answer · answered Sep 06 '16 at 19:38

6

df = pd.DataFrame([['a', 'b'], ['b', 'c'], ['d', 'a']])

v = df.stack().unique()
v.sort()
f = pd.factorize(v)
m = pd.Series(f[0], f[1])

df.stack().map(m).unstack()

answered Sep 06 '16 at 19:38

piRSquared

285,575
57
475
624

1

That's a smart and very neat solution! – MaxU - stand with Ukraine Sep 06 '16 at 19:42

MaxU - stand with Ukraine · Answer 2 · 2016-09-06T21:10:49.513

UPDATE: here is a memory saving solution, which converts all your string to numerical categories:

In [13]: df
Out[13]:
        c1       c2
0  stringa  stringb
1  stringb  stringc
2  stringd  stringa
3  stringa  stringb
4  stringb  stringc
5  stringd  stringa
6  stringa  stringb
7  stringb  stringc
8  stringd  stringa

In [14]: x = (df.stack()
   ....:        .astype('category')
   ....:        .cat.rename_categories(np.arange(len(df.stack().unique())))
   ....:        .unstack())

In [15]: x
Out[15]:
  c1 c2
0  0  1
1  1  2
2  3  0
3  0  1
4  1  2
5  3  0
6  0  1
7  1  2
8  3  0

In [16]: x.dtypes
Out[16]:
c1    category
c2    category
dtype: object

OLD answer:

I think a you can categorize your columns:

In [63]: big.head(15)
Out[63]:
         c1       c2
0   stringa  stringb
1   stringb  stringc
2   stringd  stringa
3   stringa  stringb
4   stringb  stringc
5   stringd  stringa
6   stringa  stringb
7   stringb  stringc
8   stringd  stringa
9   stringa  stringb
10  stringb  stringc
11  stringd  stringa
12  stringa  stringb
13  stringb  stringc
14  stringd  stringa

In [64]: big.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000000 entries, 0 to 29999999
Data columns (total 2 columns):
c1    object
c2    object
dtypes: object(2)
memory usage: 457.8+ MB

So big DF has 30M rows and it's size is approx. 460MiB...

Let's categorize it:

In [65]: cat = big.apply(lambda x: x.astype('category'))

In [66]: cat.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000000 entries, 0 to 29999999
Data columns (total 2 columns):
c1    category
c2    category
dtypes: category(2)
memory usage: 57.2 MB

It takes now only 57MiB and looks exactly the same:

In [69]: cat.head(15)
Out[69]:
         c1       c2
0   stringa  stringb
1   stringb  stringc
2   stringd  stringa
3   stringa  stringb
4   stringb  stringc
5   stringd  stringa
6   stringa  stringb
7   stringb  stringc
8   stringd  stringa
9   stringa  stringb
10  stringb  stringc
11  stringd  stringa
12  stringa  stringb
13  stringb  stringc
14  stringd  stringa

let's compare it's size with similar numeric DF:

In [67]: df = pd.DataFrame(np.random.randint(0,5,(30000000,2)), columns=list('ab'))

In [68]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000000 entries, 0 to 29999999
Data columns (total 2 columns):
a    int32
b    int32
dtypes: int32(2)
memory usage: 228.9 MB

What I mean is that I really need the output to be in the same format as the input but with the ids renamed. — Simd, Sep 06 '16 at 19:33
@eleanora, please check piRSquared's solution - it'll give you IDs — MaxU - stand with Ukraine, Sep 06 '16 at 19:44
@eleanora, i've added solution which converts your data into numerical __categories__, which should save you a lot of memory — MaxU - stand with Ukraine, Sep 06 '16 at 20:26
Your code needs the whole input read into a data frame in memory it seems. Can the category trick be applied at the time of reading the data too? — Simd, Sep 07 '16 at 05:25

Nickil Maveli · Answer 3 · 2016-09-07T18:54:31.797

You can use factorize if you want an array of id's:

df = pd.read_csv(data, header=None, prefix='Col_')
print (pd.factorize(np.hstack(df.values)))

(array([0, 1, 1, 2, 3, 0]), array(['stringa', 'stringb', 'stringc', 'stringd'], dtype=object))

EDIT : (as per the comment)

You could take the slices of the tuple obtained after the factorize method and map accordingly to the entire dataframe by replacing one another as shown:

num, letter = pd.factorize(np.hstack(df.values))

df.replace(to_replace=sorted(list(set(letter))), value=sorted(list(set(num))))

   Col_0  Col_1
0      0      1
1      1      2
2      3      0

I really need the output to be in the same format as the input just with the ids renamed. — Simd, Sep 06 '16 at 19:32

How to remap ids to consecutive numbers quickly

3 Answers3

Linked