How to handle duplicated or blank columns and add number in duplicated ones in python dataframe?

Question

Dataframe has 200+columns with repeated dates and empty columns

         weight height total  weight height total
          2019   2019   2019  2020   2020   2020
Species  jan1   jan1    ''    jan1   jan1    ''
cat      1.0    2.0     3     4.0    3.0     7
dog      3.0    4.0     9     4.0    5.0     9

I tried:

[x for x in df.columns if df.columns.count(x) >1]

#error: 'MultiIndex' object has no attribute 'count'

df.stack(dropna=False)

#error: cannot reindex from a duplicate axis

Objective: To add any string value like 'a.jan1' repeated column and blank columns renamed as a, b.... so on.

Output required in tabular form for further processing and storage ..


 class    year    Month    cat    dog   
 weight   2019    jan1     1       3    
 height   2019    jan1     2       4
 weight   2020    jan1     4       4
 height   2020    jan1     3       5

As I see your columns are a multindex object. That is, a list of unique tuples, of three elements on this concrete case. I don't quite understand why you want to modify the third element of the tuples, given the fact that the tuples are unique. Anyway, I think you will find easier to work with this transformation of your columns: arrays= [list(t) for t in zip(*df.columns)] Then the list of the third elements of each tuple is arrays[2] You can do the transformations that you want on this list, reconstruct the MultiIndex object and assign it back to df.columns — migjimen, Dec 29 '21 at 18:15

Laurent · Accepted Answer · 2022-01-01T19:23:02.577

So, given the following dataframe:

import pandas as pd

df = pd.DataFrame(
    {
        ("weight", 2019, "jan1"): {"cat": 1, "dog": 3},
        ("height", 2019, "jan1"): {"cat": 2, "dog": 4},
        ("total", 2019, ""): {"cat": 3, "dog": 9},
        ("weight", 2020, "jan1"): {"cat": 4, "dog": 4},
        ("height", 2020, "jan1"): {"cat": 3, "dog": 5},
        ("total", 2020, ""): {"cat": 7, "dog": 9},
    }
)
print(df)
# Outputs
    weight height total weight height total
      2019   2019  2019   2020   2020  2020
      jan1   jan1         jan1   jan1      
cat      1      2     3      4      3     7
dog      3      4     9      4      5     9

You could try this:

# UNpivot the dataframe
new_df = df.reset_index().melt(
    id_vars=[("index", "", "")],
    value_vars=[
        ("weight", 2019, "jan1"),
        ("height", 2019, "jan1"),
        ("weight", 2020, "jan1"),
        ("height", 2020, "jan1"),
    ],
)
new_df.columns = ["species", "class", "year", "month", "value"]

# Make separate dataframes for "cats" and "dogs" and store them in a list
temp_dfs = []
for species in new_df["species"].unique():
    temp_df = new_df.loc[new_df["species"] == species, :]
    temp_df = temp_df.rename(columns={"value": species}).drop(columns="species")
    temp_dfs.append(temp_df)

# Merge "cats" and "dogs"
final_df = temp_dfs[0]
for temp_df in temp_dfs[1:]:
    final_df = pd.merge(final_df, temp_df, on=["class", "year", "month"], how="outer")

And so:

print(final_df)
# Output
    class  year month  cat  dog
0  weight  2019  jan1    1    3
1  height  2019  jan1    2    4
2  weight  2020  jan1    4    4
3  height  2020  jan1    3    5

How to handle duplicated or blank columns and add number in duplicated ones in python dataframe?

1 Answers1