Remove common word from headers in pandas data frame

Question

Lets say I had the following dataframe

import pandas as pd

data = [['Mallika', 23, 'Student'], ['Yash', 25, 'Tutor'], ['Abc', 14, 'Clerk']]

data_frame = pd.DataFrame(data, columns=['Student.first.name.word', 'Student.Current.Age.word', 'Student.Current.Profession.word'])

  Student.first.name.word  Student.Current.Age.word Student.Current.Profession.word
0           Mallika                23                 Student
1              Yash                25                   Tutor
2               Abc                14                   Clerk

How would I sub out the common column header words "Student" and "word"

so that you would get the following dataframe:

      first.name  Current.Age Current.Profession
0  Mallika   23    Student
1     Yash   25      Tutor
2      Abc   14      Clerk

are the commun words list unknowing or just the words student and word? — Yefet, May 11 '21 at 10:54
See the linked answer, it is a general way of detecting common prefixes, without having to hardcode the pattern. — Erfan, May 11 '21 at 10:54
@Erfan `os.path.commonprefix` would only get rid of prefix `Student` and not `word` — Yefet, May 11 '21 at 11:12

Mustafa Aydın · Answer 1 · 2021-05-11T11:36:54.310

3

You can remove those words and .s from the columns with a regex and assign it back:

data_frame.columns = data_frame.columns.str.replace(r"(Student|word|\.)", "")

to get

>>> data_frame

      name  Age Profession
0  Mallika   23    Student
1     Yash   25      Tutor
2      Abc   14      Clerk

after update

You can split - slice - join:

data_frame.columns = data_frame.columns.str.split(r"\.").str[1:-1].str.join(".")

i.e. split over literal dot, take out first & last elements and lastly join them with a dot

to get

  first.name  Current.Age Current.Profession
0    Mallika           23            Student
1       Yash           25              Tutor
2        Abc           14              Clerk

edited May 11 '21 at 11:36

answered May 11 '21 at 10:49

Mustafa Aydın

17,645
4
15
38

Great I've edited my post to get a slightly different answer, I wonder if you could help. I now have multiple words but want to keep a full stop in between the remaining words – British Bioinformatician May 11 '21 at 11:23
@BritishBioinformatician i edited with a version for that case, hope it helps. – Mustafa Aydın May 11 '21 at 11:37
Thanks it works great, is there any way I could apply this to only certain columns, say the 2nd and 3rd but not the first? – British Bioinformatician May 13 '21 at 07:09
@BritishBioinformatician Might be a bit involved but if you slice, you can: `data_frame.columns = [data_frame.columns[0], *data_frame.columns[1:].str.split(r"\.").str[1:-1].str.join(".")]`. – Mustafa Aydın May 13 '21 at 07:58

Erfan · Answer 2 · 2021-05-11T11:30:09.843

Here's is an extension of my answer to remove common prefixes. The benefit of this method is that it finds the prefixes and suffixes in a general way, so no need to hardcode any patterns.

cols = data_frame.columns

common_prefix = os.path.commonprefix(cols.tolist())
common_suffix = os.path.commonprefix([col[::-1] for col in cols])[::-1]

data_frame.columns = cols.str.replace(f"{common_prefix}|{common_suffix}", "", regex=True)

      name  Age Profession
0  Mallika   23    Student
1     Yash   25      Tutor
2      Abc   14      Clerk

Update, same solution works in a general way for the updated question:

  first.name  Current.Age Current.Profession
0    Mallika           23            Student
1       Yash           25              Tutor
2        Abc           14              Clerk

Thanks I like this solution however how would I select specific columns to use this on? say from the 2nd column to the last column? — British Bioinformatician, May 12 '21 at 11:56

Yefet · Answer 3 · 2021-05-11T11:33:11.077

to remove all words and not just hard coded ones you can try

df = data_frame
from functools import reduce
common_words = [i.split(".") for i in df.columns.tolist()]
common_words =reduce(lambda x,y : set(x).intersection(y) ,common_words)
pat = r'\b(?:{})\b'.format('|'.join(common_words))

df.columns = df.columns.str.replace(pat, "").str[1:-1]

Output:

print(df)


    first.name  Current.Age Current.Profession
0   Mallika     23          Student
1   Yash        25          Tutor
2   Abc         14          Clerk

Remove common word from headers in pandas data frame

3 Answers3