0

I have a dataframe that looks like this:

Col1  | Col2  | Col1  | Col3  | Col1  | Col4
  a   |   d   |       |   h   |   a   |   p
  b   |   e   |   b   |   i   |   b   |   l
      |   l   |   a   |   l   |       |   a
  l   |   r   |   l   |   a   |   l   |   x
  a   |   i   |   a   |   w   |       |   i
      |   c   |       |   i   |   r   |   c
  d   |   o   |   d   |   e   |   d   |   o

Col1 is repeated multiple times in the dataframe. In each Col1, there is missing information. I need to create a new column that has all of the information from each Col1 occurrence.

How can I create a column with the complete information and then delete the previous duplicate columns?

Some information may be missing from multiple columns. This script is also meant to be used in the future when there could be one, three, five, or any number of duplicated Col1 columns.

The desired output looks like this:

Col2  | Col3  | Col4  | Col5
  d   |   h   |   p   |   a
  e   |   i   |   l   |   b
  l   |   l   |   a   |   a
  r   |   a   |   x   |   l
  i   |   w   |   i   |   a
  c   |   i   |   c   |   r
  o   |   e   |   o   |   d

I have been looking over this question but it is not clear to me how I could keep the desired Col1 with complete values. I could delete multiple columns of the same name but I need to first create a column with complete information.

adin
  • 783
  • 3
  • 13
  • 27
  • They all have the same name? – AMC Dec 18 '19 at 01:29
  • 2
    You need to give more info. How do you come up with values of `Col1` and `Col5` of the desired output? why is `Col1` of the output the same as `Col4` of the sample data? – Andy L. Dec 18 '19 at 01:42
  • 2
    Are you **certain** that when the column is duplicated the values in each row are always the same when not missing? It's `groupby` + `first` in that case. – ALollz Dec 18 '19 at 01:45
  • I can't get a dataframe to have the same name columns, it renames the to Col1.1 and Col1.2 , etc – oppressionslayer Dec 18 '19 at 01:49
  • @oppressionslayer that because pandas really doesn't want them to be duplicated and in general it's not great when they are (though in this case it seems useful). You can get around that by manually providing a list to set columns: `df.columns = ['Col1', 'Col2', 'Col1', 'Col3', 'Col1', 'Col4']` – ALollz Dec 18 '19 at 01:50
  • Your expected output is wrong. Col1 does not contain `p` or `x` or `i` in the input but you show in the output – moys Dec 18 '19 at 01:58
  • 1
    @ALollz that worked, thx! – oppressionslayer Dec 18 '19 at 02:02

2 Answers2

2

First replace empty values in your columns with nan as below:

import numpy as np
df = df.replace(r'^\s*$', np.nan, regex=True)

Then, you could use groupby and then first()

df.groupby(level = 0, axis = 1).first() 
Santosh M.
  • 2,356
  • 1
  • 17
  • 29
  • This seemed to work but I want to note it shifted reordered the columns. I am accepting this answer because in the original question I did not specify that column order must remain the same. – adin Dec 18 '19 at 13:15
0

May be something like this is what you are looking for.

col_list = list(set(df.columns))
dicts={}
for col in col_list:
    val = list(filter(None,set(df.filter(like=col).stack().reset_index()[0].str.strip(' ').tolist())))
    dicts[col]= val
max_len=max([len(k) for k in dicts.values()])
pd.DataFrame({k:pd.Series(v[:max_len]) for k,v in dicts.items()})

output

   Col3     Col4    Col1    Col2
0   h          i    d       d
1   w          l    b       r
2   i          c    r       i
3   l          x    l       l
4   a          p    a       o
5   e          o    NaN     c
6   NaN        a    NaN     e
moys
  • 7,747
  • 2
  • 11
  • 42