Replace values in a pandas dataframe

Question

I have a pandas dataframe which is generated based on events. each event has an unique ID and it generates repeated rows in the dataframe.

The problem is that some of these repeated rows contains random values whih they are different from each other.

I need to replace values in the columns( Name, Age Occupation) based on the most frequent one per event_id.

also the salary column has trailing hyphen needed to remove that as well

Thanks in advance

input data



print(df)

ID  event_id   Month    Name    Age Occupation Salary  
1   1_a        Jan      andrew  23             13414.12
2   1_a        Feb              NaN teacher    13414.12
3   1_a        Mar       ___                   13414.12
4   1_a        Apr      andrew  23  teacher    13414.12
5   1_a        May      andrew  24  principle  25000
6   1_b        Jan      Ash     45  scientist  1975.42_
7   1_b        Feb      #$%6        scientist  1975.42
8   1_b        Mar      Ash     45  ^#3a2g4    1975.42
9   1_b        Apr      Ash     45  scientist  1975.42

Desired output :

print(df)

ID  event_id   Month    Name    Age Occupation Salary
1   1_a        Jan      andrew  24  principle  25000
2   1_a        Feb      andrew  24  principle  25000
3   1_a        Mar      andrew  24  principle  25000
4   1_a        Apr      andrew  24  principle  25000
5   1_a        May      andrew  24  principle  25000
6   1_b        Jan      Ash     45  scientist  1975.42
7   1_b        Feb      Ash     45  scientist  1975.42
8   1_b        Mar      Ash     45  scientist  1975.42
9   1_b        Apr      Ash     45  scientist  1975.42

Can you share the raw data where you are getting the dataframe? — Cesar Lopes, Jun 12 '22 at 18:45
@CesarLopes didn't understand brother, raw data is mentioned as Input, — Hemant Sain, Jun 12 '22 at 18:49
I was mentioning the whole logic until you got this final df, sorry, could have explained better — Cesar Lopes, Jun 12 '22 at 18:50
@mozway junk as in absurd values, I want to replace them by the most frequent values present in their respective event_id — Hemant Sain, Jun 12 '22 at 19:03
@mozway anything which does not match the mode of the values present in the column, based on event_id — Hemant Sain, Jun 12 '22 at 19:36
Well, what is doable is get the most frequent one. Other than that, you would need some complex regex to differenciate a name from a random string, same as the other column values. — Cesar Lopes, Jun 12 '22 at 20:12

Cesar Lopes · Accepted Answer · 2022-06-12T19:45:13.600

First I had to create the DataFrame, unfortunately, I couldn't split the values from a raw_string with blank spaces, but in your dataframe, that shouldn't be a problem.

Ok, now the logic:

The code creates a list with unique values of the events, then I iterate for the columns of each event. With collections I can get a dictionary to count the frequency of the values in the filtered event column, and with the most frequent I set up the others.

That only won't work if your table has more repeated junk than good values. For example: If you have 30 junk values in a column filtered by event, but only the good one is repeated 2x, then the good one will be the replaced value.

If you have 30 junk values in a column filtered by event, but the good one appears only one time, then a random junk will be your replaced value.

Here is the code:

import pandas as pd
import collections

data =   """ID  event_id   Month    Name    Age Occupation Salary  
            1   1_a        Jan      andrew  23     -       13414.12
            2   1_a        Feb        -     NA  teacher    13414.12
            3   1_a        Mar       ___     -     z       13414.12
            4   1_a        Apr      andrew  23  teacher    13414.12
            5   1_a        May      andrew  24  principle  25000
            6   1_b        Jan      Ash     45  scientist  1975.42_
            7   1_b        Feb      #$%6     -  scientist  1975.42
            8   1_b        Mar      Ash     45  ^#3a2g4    1975.42
            9   1_b        Apr      Ash     45  scientist  1975.42"""

data = data.split('\n')[1:]

for i in range(len(data)):
    data[i] = data[i].split()

df = pd.DataFrame(data, columns=['ID', 'event_id','Month', 'Name', 'Age', 'Occupation', 'Salary'])

print(df)
print('\n')
events = set([x for x in df['event_id']])
columns = ['Name', 'Age', 'Occupation', 'Salary']
for event in events:
    print(df.loc[df['event_id'] == event])
    for column in columns:
        counter = collections.Counter(df.loc[df['event_id'] == event][column])
        print(df.loc[df['event_id'] == event][column])
        print()
        new_value = max(counter, key=counter.get)
        for i in df.loc[df['event_id'] == event][column].index.tolist():
            df[column][i] = new_value

print(df)

Output:

  ID event_id Month    Name Age Occupation    Salary
0  1      1_a   Jan  andrew  23    teacher  13414.12
1  2      1_a   Feb  andrew  23    teacher  13414.12
2  3      1_a   Mar  andrew  23    teacher  13414.12
3  4      1_a   Apr  andrew  23    teacher  13414.12
4  5      1_a   May  andrew  23    teacher  13414.12
5  6      1_b   Jan     Ash  45  scientist   1975.42
6  7      1_b   Feb     Ash  45  scientist   1975.42
7  8      1_b   Mar     Ash  45  scientist   1975.42
8  9      1_b   Apr     Ash  45  scientist   1975.42

Process finished with exit code 0

Replace values in a pandas dataframe

1 Answers1