4

I have a dataframe which looks as follows:

# df

colA     colB          colC

rqp      129            a   
pot      217;345        u
ghay     716            b
rbba     217;345        d
tary     612;811;760    a
kals     716            t

The ColB (any component out of two shown) & ColC combination make a unique combination. I want to create a dataframe from this dataframe which will look like the following

# newdf:

colAA      coLBB      

(129,a)    (a,rqp)
(217,u)    (u,pot)
(345,u)    (u,pot)
(716,b)    (b,ghay)
(217,d)    (d,rbba)
(345,d)    (d,rbba)
(612,a)    (a,tary)
(811,a)    (a,tary)
(760,a)    (a,tary)
(716,t)    (t,kals)

I have tried creating new columns if there is a single element in colB, but cannot get how to do using the semicolon splitter and then how to create columns of tuples.

If I do not have any semicolon in ColB, then I can use

df['AA'] = list(zip(df[colB], df[colC]))
df['AB'] = list(zip(df[colC], df[colA]))

However the presence of semicolon in ColB I am getting issues, as how to split that string into multiple ones and assign the tuples. Any help will be very much appreciated.

Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
Stan
  • 786
  • 1
  • 9
  • 25
  • post some code which you already tried – EXODIA Jun 03 '20 at 04:29
  • Thanks Exodia for pointing it out. I have added the code which I tried without the semicolon in colB which is working. However, with the semicolon, I have not yet been able to. – Stan Jun 03 '20 at 04:33
  • You want new columns if there are more element splitted by ```';'``` or is it you want multiple tuples in same column – EXODIA Jun 03 '20 at 04:49

3 Answers3

2

Use, Series.str.split to split the strings of column colB around the delimiter ; then using DataFrame.explode transform each element of a column colB to a row, replicating index values. Then use DataFrame.agg to aggregate the required columns as tuple along axis=1:

df['colB'] = df['colB'].str.split(';')
df = df.explode('colB')
df['ColAA'] = df[['colB', 'colC']].agg(tuple, axis=1)
df['ColBB'] = df[['colC', 'colA']].agg(tuple, axis=1)
df = df[['ColAA', 'ColBB']].reset_index(drop=True)

Result:

# print(df)
     ColAA      ColBB
0  (129, a)   (a, rqp)
1  (217, u)   (u, pot)
2  (345, u)   (u, pot)
3  (716, b)  (b, ghay)
4  (217, d)  (d, rbba)
5  (345, d)  (d, rbba)
6  (612, a)  (a, tary)
7  (811, a)  (a, tary)
8  (760, a)  (a, tary)
9  (716, t)  (t, kals)
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
  • 1
    Wonderful. Thank you Shubham. I learnt a new method in explode from you! Sweet man! – Stan Jun 03 '20 at 05:44
0

You can simply use str.split() with expand parameters to split the list and get new columns based on split.

df['colB'].str.split(pat=';',expand=True)

Read more here

EXODIA
  • 908
  • 3
  • 10
  • 28
0

Code

def merge(row):
    return pd.Series({
            "colAA": (row.colB, row.colC),
            "colBB": (row.colC, row.colA),
        })

df['colB'] = df['colB'].str.split(';')
df = df.explode('colB')
newDf = df.apply(merge, axis=1).reset_index(drop=True)

Explanation

You can split colB to get list of values, Then apply explode function to get multiple rows

df['colB'] = df['colB'].str.split(';')
df = df.explode('colB')

# output
    colA    colB    colC
0   rqp 129 a
1   pot 217 u
1   pot 345 u
2   ghay    716 b
3   rbba    217 d

Then apply merge function below to create new data frame

def merge(row):
    for b in row.colB.split(";"):
         return pd.Series({
            "colAA": (b, row.colC),
            "colBB": (row.colC, row.colA),

        })

Then apply this function on Df

newDf = df.apply(merge, axis=1).reset_index(drop=True)

# output
    colAA        colBB
0   (129, a)    (a, rqp)
1   (217, u)    (u, pot)
2   (345, u)    (u, pot)
3   (716, b)    (b, ghay)
4   (217, d)    (d, rbba)
5   (345, d)    (d, rbba)
6   (612, a)    (a, tary)
7   (811, a)    (a, tary)
8   (760, a)    (a, tary)
9   (716, t)    (t, kals)
Anurag Wagh
  • 1,086
  • 6
  • 16