1

BACKGROUND: Large excel mapping file with about 100 columns and 200 rows converted to .csv. Then stored as dataframe. General format of df as below.

Starts with a named column (e.g. Sales) and following two columns need to be renamed. This pattern needs to be repeated for all columns in excel file.

Essentially: Link the subsequent 2 columns to the "parent" one preceding them.

 Sales Unnamed: 2  Unnamed: 3  Validation Unnamed: 5 Unnamed: 6
0       Commented  No comment             Commented  No comment                                   
1     x                                             x                        
2                            x          x                                                
3                x                                             x 

APPROACH FOR SOLUTION: I assume it would be possible to begin with an index (e.g. index of Sales column 1 = x) and then rename the following two columns as (x+1) and (x+2). Then take in the text for the next named column (e.g. Validation) and so on.

I know the rename() function for dataframes.

BUT, not sure how to apply the iteratively for changing column titles.

EXPECTED OUTPUT: Unnamed 2 & 3 changed to Sales_Commented and Sales_No_Comment, respectively.

Similarly Unnamed 5 & 6 change to Validation_Commented and Validation_No_Comment.

Again, repeated for all 100 columns of file.

EDIT: Due to the large number of cols in the file, creating a manual list to store column names is not a viable solution. I have already seen this elsewhere on SO. Also, the amount of columns and departments (Sales, Validation) changes in different excel files with the mapping. So a dynamic solution is required.

  Sales Sales_Commented Sales_No_Comment Validation Validation_Commented Validation_No_Comment
0             Commented       No comment                       Commented            No comment
1     x                                                                x                      
2                                      x                                                      
3                     x                           x                                          x

As a python novice, I considered a possible approach for the solution using the limited knowledge I have, but not sure what this would look like as a workable code.

I would appreciate all help and guidance.

mvx
  • 320
  • 4
  • 15

1 Answers1

2

1.You need is to make a list with the column names that you would want.
2.Make it a dict with the old column names as the keys and new column name as the values.
3. Use df.rename(columns = your_dictionary).

import numpy as np
import pandas as pd
df = pd.read_excel("name of the excel file",sheet_name = "name of sheet")


print(df.head()) 
Output>>>
    Sales   Unnamed : 2     Unnamed : 3     Validation  Unnamed : 5     Unnamed : 6     Unnamed :7
0   NaN     Commented   No comment  NaN     Comment     No comment  Extra
1   1.0     2   1   1.0     1   1   1
2   3.0     1   1   1.0     1   1   1
3   4.0     3   4   5.0     5   6   6
4   5.0     1   1   1.0     21  3   6

# get new names based on the values of a previous named column
new_column_names = []
counter = 0
for col_name in df.columns:

    if (col_name[:7].strip()=="Unnamed"):

        new_column_names.append(base_name+"_"+df.iloc[0,counter].replace(" ", "_"))
    else:
        base_name = col_name
        new_column_names.append(base_name)

    counter +=1


# convert to dict key pair
dictionary = dict(zip(df.columns.tolist(),new_column_names))

# rename columns
df = df.rename(columns=dictionary)

# drop first column
df = df.iloc[1:].reset_index(drop=True)

print(df.head())
Output>>
    Sales   Sales_Commented     Sales_No_comment    Validation  Validation_Comment  Validation_No_comment   Validation_Extra
0   1.0     2   1   1.0     1   1   1
1   3.0     1   1   1.0     1   1   1
2   4.0     3   4   5.0     5   6   6
3   5.0     1   1   1.0     21  3   6
aunsid
  • 397
  • 2
  • 10
  • The actual excel file has way more columns than the 2 examples here. And due the nature of the data, it is not possible to predict which column titles will appear for different files. I would need a dynamic solution and not manual (storing them in list/dict). I saw a replace Old_cols with New_cols solution somewhere on SO, but it didn't work because, again, I have a large file and not a specified number of columns every time. – mvx Aug 26 '19 at 14:41
  • Well then change if condition and the append statement such that every time you get "Unnamed " you will have to replace it with the previously found name+string. May be a counter. So it will be new_column_names.append(base_name+"_"+str(counter)) Or is commented or uncommented below in another row that I read it wrong? – aunsid Aug 26 '19 at 14:45
  • I tried the code. Two questions. **(1)** The dataframe has some empty columns at the end, which are also renamed to the last `Titled Column` in the list. This creates about 30 repeated copies of the `LastColumnName_No_Comment`. Is there a way around this? **(2)** When I apply the `.rename()` function, some of the renamed columns from the new_list are simply deleted. The output of the new dataframe starts from about midway through the new_list. What could be the cause for this? Would you know how to fix it? – mvx Aug 26 '19 at 15:17
  • For 1. You can stop till where you want the loop to end by putting a break on the counter value with an if statement. And then make another loop with whatever you want to name the empty columns (Assuming empty columns have values but no column name) 2. I think that there is ```Named Column`` and more than two unnamed and it goes wrong. I have edited the code for merging row 0 with the column name tell me if that helps. – aunsid Aug 26 '19 at 15:26
  • **(1)** empty columns are just extra columns with no title and no values. Could also be deleted, but I haven't figured out a way how to. since there is already an if-else loop in the code, how do I add another if statement? Do I use elif? Would the break statement be after the else in the code? **(2)** Some error on my part yesterday. Output as expected today with the old code you wrote. Edited your answer to include the old code - waiting for review. – mvx Aug 27 '19 at 09:04
  • 1
    **(1)** you can add another for loop like the one in the code with if else, and use df.empty to see if the column series is empty [link] https://stackoverflow.com/questions/24652417/how-to-check-if-pandas-series-is-empty. **(2)** the new code should work as well. Plus the new code can take in multiple unnamed columns. – aunsid Aug 27 '19 at 14:07