0

I have a dataset that looks like this:

postcode house_number  col2  col3
xxx      xxx           xxx   xxx
xxx      xxx           xxx   xxx

I want to group the data by postcode and house_number, if two rows have the same postcode and house_number, it means they are the same property, then I want to construct a unique_id for each property (in other words, for a unique_id, the postcode / house_number must be the same, but the value for col2 / col3 might be different), something like:

unique_id postcode house_number  col2  col3
0         111      222           xxx   xxx 
0         111      222           xxx   xxx
1         xxx      xxx           xxx   xxx
.....

I tried new_df = ppd_df.groupby(['postcode','house_number']).reset_index() but it gave me error AttributeError: 'DataFrameGroupBy' object has no attribute 'reset_index', also I'm not sure how to construct the column unique_id. Can someone help please? Thanks.

wawawa
  • 2,835
  • 6
  • 44
  • 105

2 Answers2

1

Use groupby ngroup to "[n]umber each group from 0 to the number of groups - 1":

df['unique_id'] = df.groupby(['postcode', 'house_number']).ngroup()

df:

   postcode  house_number  col2  col3  unique_id
0       111           222     1     5          0
1       111           222     2     6          0
2       111           444     3     7          1
3       333           333     4     8          2

insert can be used to make it the first column if needed:

df.insert(0, 'unique_id', df.groupby(['postcode', 'house_number']).ngroup())

df:

   unique_id  postcode  house_number  col2  col3
0          0       111           222     1     5
1          0       111           222     2     6
2          1       111           444     3     7
3          2       333           333     4     8

*Note: sort=False can be used to ensure that groups are enumerated in the order in which they appear in the DataFrame:

df['unique_id'] = df.groupby(['postcode', 'house_number'], sort=False).ngroup()

Or

df.insert(0, 'unique_id',
          df.groupby(['postcode', 'house_number'], sort=False).ngroup())

Otherwise groups will be enumerated in the "order in which the groups would be seen when iterating over the groupby object, not the order they are first observed."


DataFrame and imports:

import pandas as pd

df = pd.DataFrame({
    'postcode': [111, 111, 111, 333],
    'house_number': [222, 222, 444, 333],
    'col2': [1, 2, 3, 4],
    'col3': [5, 6, 7, 8],
})
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
  • Hi I tried your approach, but the value for `unique_id` starts from -1 not 0 – wawawa Aug 15 '21 at 20:08
  • That is not possible. `ngroup` cannot generate negative numbers. – Henry Ecker Aug 15 '21 at 20:08
  • I forgot the handy ngroup that is actually fetching the grouper.group_info[0] value, thanks for the reminder ;) – mozway Aug 15 '21 at 20:10
  • Ah right, one of the columns I used to groupby contains multiple `NaN` values, I removed that column, looks like it works now – wawawa Aug 15 '21 at 20:13
  • 1
    Ah very interesting. If both postcode or house_number are `NaN` groups do appear to give back -1. Very weird. I don't think that is the intended behaviour... I assume since `NaN != NaN` it can't determine how to group it. – Henry Ecker Aug 15 '21 at 20:15
  • Thanks, this is really helpful! Just wondering if you know how to achieve this by using Apache Beam? I've got another question here: https://stackoverflow.com/questions/68795096/use-apache-beam-to-groupby-and-construct-a-new-column-python Any help would be appreciated. – wawawa Aug 15 '21 at 20:39
  • Sorry. Not an Apache guy. But glad I could help here :) – Henry Ecker Aug 15 '21 at 20:41
0

Create tuple from postcode and house_number columns and use pd.factorize to get a unique identifier:

df['unique_id'] = \
    pd.factorize(df[['postcode', 'house_number']].apply(tuple, axis=1))[0]
>>> df
   postcode  house_number col2 col3  unique_id
0       111           222  xxx  xxx          0
1       111           333  xxx  xxx          1
2       111           222  xxx  xxx          0
3       111           222  xxx  xxx          0
4       222           444  xxx  xxx          2
Corralien
  • 109,409
  • 8
  • 28
  • 52