Sort/Align dataframe columns so that row string values match master list column, if no match in column print 0

Question

I want to sort a dataframe (3106 rows x 24 columns) based on a master list of strings in the first column ('MASTER') so that the strings in each row of the dataframe are aligned if there is a match and if there is no match then print 0 for that cell. The master list contains every possible string in any column but not every string will appear in each column.

This SO post is pretty close to what I'm trying to do:

Align DataFrames in pandas so that data matches in rows

I tried using this approach and realized it just looks to see if each string is a match based on the order in the master column but doesn't search the whole column for a match or do any reordering.

new_data = data[data.apply(lambda x: x['MASTER'] == x, axis=1)]

Here is what I'm trying to accomplish with my data.

BEFORE

MASTER         CENTRAL_NERVOUS  NERVOUS_SYSTEM  NEUROGENESIS
12AEX          9630013A20RIK    12AEX           2610301B20RIK
2610042L04RIK  AARS             2610042L04RIK   9630013A20RIK
2610301B20RIK  AATK             2610301B20RIK   A830010M20RIK
2700046G09RIK  ABCA2            2700046G09RIK   AU040320
31BEX          ABCB1B           31BEX           AATK
38DLP          ABCB6            7A6EX           ABCA2
7A6EX          ABL1             9630013A20RIK   ABCC8
9630013A20RIK  ABL2             97C2            ABI1
97C2           ABR              A830010M20RIK   ABI2
...            ...              ...             ...

AFTER

MASTER         CENTRAL_NERVOUS  NERVOUS_SYSTEM  NEUROGENESIS
12AEX          0                12AEX           0
2610042L04RIK  0                2610042L04RIK   0
2610301B20RIK  0                2610301B20RIK   2610301B20RIK
2700046G09RIK  0                2700046G09RIK   0
31BEX          0                31BEX           0
38DLP          0                0               0
7A6EX          0                7A6EX           0
9630013A20RIK  9630013A20RIK    9630013A20RIK   9630013A20RIK
97C2           0                97C2            0

I've been trying to do this with pandas/python but would be open to solutions w/ awk/bash or R. Any hints/future direction appreciated since I'm stuck at this point.

Ronak Shah · Accepted Answer · 2019-10-11T02:07:03.560

1

We can use lapply and match in R

df[-1] <- lapply(df[-1], function(x) x[match(df$MASTER, x)])
#If need to change to 0 or by default it is NA
df[is.na(df)] <- 0


#         MASTER CENTRAL_NERVOUS NERVOUS_SYSTEM  NEUROGENESIS
#1         12AEX               0          12AEX             0
#2 2610042L04RIK               0  2610042L04RIK             0
#3 2610301B20RIK               0  2610301B20RIK 2610301B20RIK
#4 2700046G09RIK               0  2700046G09RIK             0
#5         31BEX               0          31BEX             0
#6         38DLP               0              0             0
#7         7A6EX               0          7A6EX             0
#8 9630013A20RIK   9630013A20RIK  9630013A20RIK 9630013A20RIK
#9          97C2               0           97C2             0

and similarly using dplyr

library(dplyr)
df %>% mutate_at(-1, ~.[match(MASTER, .)])

data

If you want to convert the values to 0 make sure that the columns are characters and not factors.

df <- structure(list(MASTER = c("12AEX", "2610042L04RIK", "2610301B20RIK", 
"2700046G09RIK", "31BEX", "38DLP", "7A6EX", "9630013A20RIK", 
"97C2"), CENTRAL_NERVOUS = c("9630013A20RIK", "AARS", "AATK", 
"ABCA2", "ABCB1B", "ABCB6", "ABL1", "ABL2", "ABR"), NERVOUS_SYSTEM = c("12AEX", 
"2610042L04RIK", "2610301B20RIK", "2700046G09RIK", "31BEX", "7A6EX", 
"9630013A20RIK", "97C2", "A830010M20RIK"), NEUROGENESIS = c("2610301B20RIK", 
"9630013A20RIK", "A830010M20RIK", "AU040320", "AATK", "ABCA2", 
"ABCC8", "ABI1", "ABI2")), class = "data.frame", row.names = c(NA, -9L))

edited Oct 11 '19 at 02:07

answered Oct 11 '19 at 01:42

Ronak Shah

377,200
20
156
213

if at all you are using matching, you can do `match(df$MASTER, x,0)` that will take care of the `NA` – Onyambu Oct 11 '19 at 01:43
The only problem with the answer is that you have the result in the 8th row for the second column but OP wants it in the 3rd row – Onyambu Oct 11 '19 at 01:45
Why would it be 3rd row? – Ronak Shah Oct 11 '19 at 01:46
I still dont know Check OP's solution. – Onyambu Oct 11 '19 at 01:47
I thought that was a mistake from OP's side since rest all of the columns are in sync with `MASTER`. I'll clarify. – Ronak Shah Oct 11 '19 at 01:48
check the last column too – Onyambu Oct 11 '19 at 01:50
Sorry I made a mistake with the expected output. I fixed above but @RonakShah output was correct. – cutts Oct 11 '19 at 06:39

Michael Gardner · Answer 2 · 2019-10-11T03:41:28.930

IIUC you can use applymap or apply in Pandas. I'm unsure about the sort order in your example.

df[df.applymap(lambda x: x in df.MASTER.tolist())].fillna(0)

or

df[df.apply(lambda col: col.isin(df.MASTER))].fillna(0)

Result:

|   | MASTER        | CENTRAL_NERVOUS | NERVOUS_SYSTEM | NEUROGENESIS  |
|---|---------------|-----------------|----------------|---------------|
| 0 | 12AEX         | 9630013A20RIK   | 12AEX          | 2610301B20RIK |
| 1 | 2610042L04RIK | 0               | 2610042L04RIK  | 9630013A20RIK |
| 2 | 2610301B20RIK | 0               | 2610301B20RIK  | 0             |
| 3 | 2700046G09RIK | 0               | 2700046G09RIK  | 0             |
| 4 | 31BEX         | 0               | 31BEX          | 0             |
| 5 | 38DLP         | 0               | 7A6EX          | 0             |
| 6 | 7A6EX         | 0               | 9630013A20RIK  | 0             |
| 7 | 9630013A20RIK | 0               | 97C2           | 0             |
| 8 | 97C2          | 0               | 0              | 0             |

And if you want the rows sorted by MASTER then use below instead of using apply.

dfs = np.split(df,len(df.columns), axis=1)
dfs = [df.set_index(df.columns[0], drop=False) for df in dfs]
dfs[0].join(dfs[1:]).reset_index(drop=True).fillna(0)

Result:

|   | MASTER        | CENTRAL_NERVOUS | NERVOUS_SYSTEM | NEUROGENESIS  |
|---|---------------|-----------------|----------------|---------------|
| 0 | 12AEX         | 0               | 12AEX          | 0             |
| 1 | 2610042L04RIK | 0               | 2610042L04RIK  | 0             |
| 2 | 2610301B20RIK | 0               | 2610301B20RIK  | 2610301B20RIK |
| 3 | 2700046G09RIK | 0               | 2700046G09RIK  | 0             |
| 4 | 31BEX         | 0               | 31BEX          | 0             |
| 5 | 38DLP         | 0               | 0              | 0             |
| 6 | 7A6EX         | 0               | 7A6EX          | 0             |
| 7 | 9630013A20RIK | 9630013A20RIK   | 9630013A20RIK  | 9630013A20RIK |
| 8 | 97C2          | 0               | 97C2           | 0             |

I mixed up the order a bit in my expected output but I've corrected it. The second expected output is what I was shooting for. My machine ran out of memory to implement the solution though. — cutts, Oct 11 '19 at 06:40

Sort/Align dataframe columns so that row string values match master list column, if no match in column print 0

2 Answers2

Linked