0

My dataset have two columns name location-id and location-name. Each location name is given a unique location id.

location-id   location-name
 234            SL
 456            IN
 234            SL
 123            EN

As each location has a unique id, unique values in location-id column and location-name column need to be equal. But there seems to be an error in the df and my location-id has 1863 unique values and location-name has 1800 unique values.

Is there a way to spot in which entries the error is made?

I thought of a way. Iterate through these two columns and create a dictionary with key-value pairs.

dict_a = {234:"SL", 456:"IN", 123:"EN"}

For each sample, get the location-id, check if it is already a key in the dict. If it is already in, check the value related to it. If that value == location-id for the current sample, then go to the next sample. If the value and location-name is different, add that new name as another value to the same key. After iterating through the complete dataset, get key-value pairs with more than one value to spot the errors in the dataset.

Is there a more efficient way to do this?

  • "add that new name as another value to the same key" python dictionary keys _must_ be unique. You cannot have multiple names with the same key. In what form do you expect your output to be when there are duplicate keys? – Henry Ecker Jul 02 '21 at 20:21
  • I guess you are using pandas, you could calculate the row number for each group (location, location_id) and filter those with that row number greater than 1. The index of those rows are indeed the errors in the dataset – Gonzalo Donoso Jul 02 '21 at 20:34

2 Answers2

0

Use boolean masks to find if location names are multiple for a same location id and vice versa.

Input data:

>>> df
   location-id location-name
0          234            SL
1          456            IN  # IN & BE
2          234            SL
3          123            EN
4          456            BE  # IN & BE
5          789            SP  # 789 & 987
6          987            SP  # 789 & 987
is_non_unique = lambda x: len(x.unique()) > 1

m1 = df.groupby('location-id')['location-name'].transform(is_non_unique)
m2 = df.groupby('location-name')['location-id'].transform(is_non_unique)

err = df[m1|m2]

Output result:

>>> err
   location-id location-name
1          456            IN
4          456            BE
5          789            SP
6          987            SP
Corralien
  • 109,409
  • 8
  • 28
  • 52
0

Use group by and then filter on those countries something like below:

t=df.groupby(['location-name']).count().reset_index()
df_filtr=df[df['location-name'].isin(t[t['location-id']>1]['location-name'])]

You can sort by countries to find the correct entry

Vishwa
  • 14
  • 2