Python deduplicate records - dedupe

Question

I want to use https://github.com/datamade/dedupe to deduplicate some records in python. Looking at their examples

data_d = {}
for row in data:
    clean_row = [(k, preProcess(v)) for (k, v) in row.items()]
    row_id = int(row['id'])
    data_d[row_id] = dict(clean_row)

the dictionary consumes quite a lot of memory compared to e.g. a dictionary created by pandas out of a pd.Datafrmae, or even a normal pd.Dataframe.

If this format is required, how can I convert a pd.Dataframe efficiently to such a dictionary?

edit

Example what pandas generates

{'column1': {0: 1389225600000000000,
  1: 1388707200000000000,
  2: 1388707200000000000,
  3: 1389657600000000000,....

Example what dedupe expects

{'1': {column1: 1389225600000000000, column2: "ddd"},
 '2': {column1: 1111, column2: "ddd} ...}

You can convert a Pandas Dataframe to dictionary using `DataFrame.to_dict()`, is this what you're looking for? — Joe T. Boka, Sep 18 '16 at 07:24
Indeed, but that is Column>Index>Value and they seem to require Index>Column>Value which re-generates the column key for each record — Georg Heiler, Sep 18 '16 at 07:27
I think this would benefit greatly from an example with data. — chthonicdaemon, Sep 18 '16 at 07:31
You mean if the columns are `'A'` and `'B'`, you're looking for something like this: `{0: {'A': 1.0, 'B': 6.0}, 1: {'A': 2.0, 'B': 7.0}}` — Joe T. Boka, Sep 18 '16 at 07:33

chthonicdaemon · Accepted Answer · 2021-08-15T05:30:27.610

3

It appears that df.to_dict(orient='index') will produce the representation you are looking for:

import pandas

data = [[1, 2, 3], [4, 5, 6]]
columns = ['a', 'b', 'c']

df = pandas.DataFrame(data, columns=columns)

df.to_dict(orient='index')

results in

{0: {'a': 1, 'b': 2, 'c': 3}, 1: {'a': 4, 'b': 5, 'c': 6}}

edited Aug 15 '21 at 05:30

answered Sep 18 '16 at 07:35

chthonicdaemon

19,180
2
52
66

Joe T. Boka · Answer 2 · 2016-09-18T07:47:12.820

You can try something like this:

df = pd.DataFrame({'A': [1,2,3,4,5], 'B': [6,7,8,9,10]})
A   B
0  1   6
1  2   7
2  3   8
3  4   9
4  5  10

print(df.T.to_dict())
{0: {'A': 1, 'B': 6}, 1: {'A': 2, 'B': 7}, 2: {'A': 3, 'B': 8}, 3: {'A': 4, 'B': 9}, 4: {'A': 5, 'B': 10}}

This is the same output as in @chthonicdaemon answer so his answer is probably better. I am using pandas.DataFrame.T to transpose index and columns.

fgregg · Answer 3 · 2016-09-18T15:17:58.310

0

A python dictionary is not required, you just need an object that allows indexing by column name. i.e. row['col_name']

So, assuming data is a pandas dataframe should just be able to do something like:

data_d = {}
for row_id, row in data.iterrows():
    data_d[row_id] = row

That said, the memory overhead of python dicts is not going to be where you have memory bottlenecks in dedupe.

edited Sep 18 '16 at 15:17

answered Sep 18 '16 at 13:30

fgregg

3,173
30
37

Python deduplicate records - dedupe

edit

3 Answers3