2

When importing data from a financial system, the columns have this unnecessary common prefix. This prefix changes from table to table.

Question: How can I automatically find the common prefix and remove it from my column names, keeping in mind that this prefix changes?

Example:

import pandas as pd
import numpy as np

cols = ["abc_W1", "abc_X2", "abc_Y3", "abc_Z4"]
data = np.random.randint(low=1, high=10, size=(4,4))
df = pd.DataFrame(data=data, columns=cols)

   abc_W1  abc_X2  abc_Y3  abc_Z4
0       2       2       3       5
1       8       4       2       7
2       1       5       1       4
3       8       7       4       1

In this case the common prefix is _abc

Also, what would be the solution when we have both common prefixes and suffixes, like below:

   abc_W1_foo  abc_X2_foo  abc_Y3_foo  abc_Z4_foo
0           1           5           3           2
1           2           1           7           1
2           2           4           7           1
3           7           9           9           3

The expected output would be:

   W1  X2  Y3  Z4
0   2   2   3   5
1   8   4   2   7
2   1   5   1   4
3   8   7   4   1
Erfan
  • 40,971
  • 8
  • 66
  • 78

1 Answers1

3

Using os.path.commonprefix with Series.str.replace

We can use os.path.commonprefix to automatically find the common prefix and use Series.str.replace to replace these by an empty string:

common_prefix = os.path.commonprefix(df.columns.tolist())
df.columns = df.columns.str.replace(common_prefix, "")

Output

   W1  X2  Y3  Z4
0   2   2   3   5
1   8   4   2   7
2   1   5   1   4
3   8   7   4   1

To remove both common prefixes and common suffixes, we can leverage os.path.commonprefix again, only this time use the column names reversed:

cols = ["abc_W1_foo", "abc_X2_foo", "abc_Y3_foo", "abc_Z4_foo"]
data = np.random.randint(low=1, high=10, size=(4,4))
df = pd.DataFrame(data=data, columns=cols)

   abc_W1_foo  abc_X2_foo  abc_Y3_foo  abc_Z4_foo
0           1           5           3           2
1           2           1           7           1
2           2           4           7           1
3           7           9           9           3
cols = df.columns

common_prefix = os.path.commonprefix(cols.tolist())
common_suffix = os.path.commonprefix([col[::-1] for col in cols])[::-1]

df.columns = cols.str.replace(f"{common_prefix}|{common_suffix}", "", regex=True)

   W1  X2  Y3  Z4
0   9   2   1   5
1   8   6   4   5
2   5   1   3   2
3   4   4   5   9

Related question and answers.

Erfan
  • 40,971
  • 8
  • 66
  • 78