1

Im trying to write a script for few ETL transformations. I have 34 fixed columns i.e. df1, according to which I have to map the column name of different input files containing different columns i.e. df2.

df1(Standard Columns):

df1 : Standard Columns

df2:

df2: input file

I have tried df.merge but that does not seem to solve my problem.

The expected result is the columns in the input file df2 to be mapped with same column name as df1 and same order as they appaer in df2with its original value intact.

Expected Result :

Result

any help will be greatly appreciated !!

K.S
  • 113
  • 13

1 Answers1

1

A way to do this would be to have an intermediate step of mapping the columns. For instance:

df2.rename(columns = {'Department Code':'Field 1 Dept Number','Column2':'2_column', .....})

And then you can merge the two dataframes on the columns of interest.

asimo
  • 2,340
  • 11
  • 29
  • Thanks, this could be helpful, but there are 34 standard columns for df1, and columns in df2 might vary in name as well as number and order they appear. Is there any dynamic way of doing this. – K.S May 07 '19 at 06:30
  • 1
    I understand what you are saying..perhaps one faster way would be to match regular expression in the columns..But ultimately you are going to need to verify the mapping is happening correctly to get your desired output(which is merging the df's rightly) – asimo May 07 '19 at 06:35
  • 1
    @K.S Assuming your `df1` has all labels in column 'A', then you could create the mapping dict like `head_map={k : df1['A'].values[i] for i,k in enumerate(df2.columns.values)}` and use this as `df2.rename(columns=head_map)`, but only if your `df1` contains all required labels, and if they are in the correct order already. – Asmus May 07 '19 at 06:36
  • @K.S Please upvote if my answer has been worthwhile – asimo May 07 '19 at 06:42
  • @Asmus I got a `KeyError: 'A'`, i tried doing `head_map={k : df1[1:].values[i] for i,k in enumerate(df2.columns.values)}`, this seems to work, but `df.rename` is not working. The output file is same as the input. – K.S May 07 '19 at 08:08
  • @asimo Will your approach keep the original value same ? – K.S May 07 '19 at 08:08
  • @K.S well you will have to use the actual name of the respective column in df1; if you're unsure, try `print(df.head(5))` and pick the actual column label which contains all your `FIELD …` labels. Did you remember to actually *store* the renamed `df2`, i.e. `df2=df2.rename(columns=head_map)`? – Asmus May 07 '19 at 08:23
  • @Asmus The column `A` contains the actual name. That's the only column in that dataframe and csv file. Also I tried `df2=df2.rename(columns=head_map)` and it changes the `column name` but value is not correct for that data. The result im getting is `['FIELD 3\xa0 SubClass number\xa0\xa0\xa0\xa0 '] ['FIELD 4\xa0 SubSubClass number\xa0 '] ['FIELD 5\xa0 SKU\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 '] ['FIELD 6\xa0 Desc\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 ']` something like this with value for earlier column. – K.S May 07 '19 at 08:33
  • Try removing `\xa0` (the non-breaking whitespace) via `df1['A']=df1['A'].astype(str).str.replace(r'\\xa0', '')` *before declaring* the `head_map`. – Asmus May 07 '19 at 08:46
  • @Asmus getting error `AttributeError: 'DataFrame' object has no attribute 'str'` – K.S May 07 '19 at 09:11
  • @K.S You could have a look [at this](https://stackoverflow.com/q/55643467/565489), for example. – Asmus May 07 '19 at 09:54
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/192955/discussion-between-k-s-and-asmus). – K.S May 07 '19 at 10:01