11

I have a question to merge two columns into one in the same dataframe(start_end), also remove null value. I intend to merge 'Start station' and 'End station' into 'station', and keep 'duration' according to the new column 'station'. I have tried pd.merge, pd.concat, pd.append, but I cannot work it out.

dataFrame of Start_end:

    Duration    End station     Start station
14  1407        NaN             14th & V St NW
19  509         NaN             21st & I St NW
20  638         15th & P St NW.  NaN
27  1532        NaN              Massachusetts Ave & Dupont Circle NW
28  759         NaN              Adams Mill & Columbia Rd NW

Expected output:

    Duration    stations
14  1407        14th & V St NW
19  509         21st & I St NW
20  638         15th & P St NW
27  1532        Massachusetts Ave & Dupont Circle NW
28  759         Adams Mill & Columbia Rd NW

Code i have so far:

#start_end is the dataframe, 'start station', 'end station', 'duration'
start_end = pd.concat([df_start, df_end])

This is what I attempted to:

station = pd.merge([start_end['Start station'],start_end['End station']])
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
BCKN
  • 191
  • 1
  • 2
  • 12
  • Define "merge" in the context of a column. Please provide a [mcve] with data, input, and expected output. Your image isn't really helping. – cs95 Jun 03 '18 at 01:49
  • What do you want the result to look like? Does `End_Station` have values other than `NaN`, if so how are they combined with `Start_Station` values? – wwii Jun 03 '18 at 01:53
  • What's the better way (what code) for me to extract/describe the dataframe structure? – BCKN Jun 03 '18 at 01:56
  • @wwii, if there is NaN, just remove the NaN value. If 'Start station' has NaN, just keep, 'End station' value, vice versa. – BCKN Jun 03 '18 at 02:00
  • What if both columns have a value other than `NaN` for a row? – wwii Jun 03 '18 at 02:05
  • @wwii, then remove the whole row. – BCKN Jun 03 '18 at 02:12
  • @wwii, I have checked, there always will be a NaN for either. 'Start station' or 'End station' . But not both. – BCKN Jun 03 '18 at 02:35

4 Answers4

14

fillna

If NaN are truly nulls

df.assign(**{
    'Start station': df['Start station'].fillna(df['End station'])})

    Duration      End station                         Start station
14      1407              NaN                        14th & V St NW
19       509              NaN                        21st & I St NW
20       638  15th & P St NW.                       15th & P St NW.
27      1532              NaN  Massachusetts Ave & Dupont Circle NW
28       759              NaN           Adams Mill & Columbia Rd NW

mask

If NaN are strings

df.assign(**{
    'Start station': df['Start station'].mask(
        lambda x: x == 'NaN', df['End station'])})

    Duration      End station                         Start station
14      1407              NaN                        14th & V St NW
19       509              NaN                        21st & I St NW
20       638  15th & P St NW.                       15th & P St NW.
27      1532              NaN  Massachusetts Ave & Dupont Circle NW
28       759              NaN           Adams Mill & Columbia Rd NW
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • `NaN` are truly nulls. This approach works for the `'Start station'`, with `'End station'` value copied over to `'Start station '`. However, both columns do not merge into one. The expected output is, `'Start station'` and `'End station'` into one column `'stations'`, and keep `'Duration'` accordingly. – BCKN Jun 03 '18 at 03:53
  • 2
    @BCKN my apologies but I'm taken aback by the comment. This post contains all the necessary information to accomplish your task. I may not have presented it in a fashion that looks like a silver platter but that's because I expected you to lift from my answer what you needed. What bothers me is that you seem to expect that I put this on the platter you wanted. It may serve to remind you that everyone who posts answers on SO are volunteers who could very well be spending their time doing something else and not helping you. If I misunderstood you, please let me know. – piRSquared Jun 03 '18 at 04:01
  • sorry about that, I didn't think hard enough. Actually, I can just reuse what you suggested, `newEnd.assign(**{'End station': newEnd['End station'].fillna(newEnd['Start station'])})`, and copy `Start station` to `End Station`. Then `.drop()` either one of them. – BCKN Jun 03 '18 at 04:40
6

Using combine_first. replaces null values in col1 with col2

df["station"] = df["End station"].combine_first(df["Start station"])
df.drop(["End station", "Start station"], 1, inplace=True)
sjd
  • 1,329
  • 4
  • 28
  • 48
5
>>> df
   Duration      End station                         Start station
0      1407              NaN                        14th & V St NW
1       509              NaN                        21st & I St NW
2       638  15th & P St NW.                                   NaN
3      1532              NaN  Massachusetts Ave & Dupont Circle NW
4       759              NaN           Adams Mill & Columbia Rd NW

Give the two columns the same name

>>> df.columns = df.columns.str.replace('.*?station', 'station')
>>> df
   Duration          station                               station
0      1407              NaN                        14th & V St NW
1       509              NaN                        21st & I St NW
2       638  15th & P St NW.                                   NaN
3      1532              NaN  Massachusetts Ave & Dupont Circle NW
4       759              NaN           Adams Mill & Columbia Rd NW

Stack then unstack.

>>> s = df.stack()
>>> s
0  Duration                                    1407
   station                           14th & V St NW
1  Duration                                     509
   station                           21st & I St NW
2  Duration                                     638
   station                          15th & P St NW.
3  Duration                                    1532
   station     Massachusetts Ave & Dupont Circle NW
4  Duration                                     759
   station              Adams Mill & Columbia Rd NW
dtype: object
>>> df = s.unstack()
>>> df
  Duration                               station
0     1407                        14th & V St NW
1      509                        21st & I St NW
2      638                       15th & P St NW.
3     1532  Massachusetts Ave & Dupont Circle NW
4      759           Adams Mill & Columbia Rd NW
>>> 

This is how I think this works:

.stack creates a series with a MultiIndex and takes care of the null values for you. It aligns the second level on the column names and because the column names are the same there is only one - unstacking just produces a single column.

That's really just a guess based on the differences between Index's if you don't change the column names.

>>> # without changing column names
>>> s.index
MultiIndex(levels=[[0, 1, 2, 3, 4], ['Duration', 'End station', 'Start station']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4], [0, 2, 0, 2, 0, 1, 0, 2, 0, 2]])

>>> # column names the same
>>> s.index
MultiIndex(levels=[[0, 1, 2, 3, 4], ['Duration', 'station']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]])

Seems a bit tricky, maybe someone will comment on it.


Alternative - Using pd.concat and .dropna

>>> stations = pd.concat([df.iloc[:,1],df.iloc[:,2]]).dropna()
>>> stations.name = 'stations'
>>> stations
2                         15th & P St NW.
0                          14th & V St NW
1                          21st & I St NW
3    Massachusetts Ave & Dupont Circle NW
4             Adams Mill & Columbia Rd NW
Name: stations, dtype: object

>>> df2 = pd.concat([df['Duration'], stations], axis=1)
>>> df2
   Duration                              stations
0      1407                        14th & V St NW
1       509                        21st & I St NW
2       638                       15th & P St NW.
3      1532  Massachusetts Ave & Dupont Circle NW
4       759           Adams Mill & Columbia Rd NW
wwii
  • 23,232
  • 7
  • 37
  • 77
  • Found an easy answer here: https://stackoverflow.com/questions/71362967/concatenate-two-columns-values-pandas However, I managed to use pd.concat( ) from your code. Thanks ! – Hyder khan Mar 04 '23 at 06:00
1

Using ffill

df.iloc[:,2:4]=df.iloc[:,2:4].ffill(1)
BENY
  • 317,841
  • 20
  • 164
  • 234
  • This approach only works for the 'End station', and 'Duration' is missing. The expected output is, 'Start station' and 'End station' into one column 'station', and keep 'Duration' accordingly. – BCKN Jun 03 '18 at 03:37