3

I want to sort a dataframe (3106 rows x 24 columns) based on a master list of strings in the first column ('MASTER') so that the strings in each row of the dataframe are aligned if there is a match and if there is no match then print 0 for that cell. The master list contains every possible string in any column but not every string will appear in each column.

This SO post is pretty close to what I'm trying to do:

Align DataFrames in pandas so that data matches in rows

I tried using this approach and realized it just looks to see if each string is a match based on the order in the master column but doesn't search the whole column for a match or do any reordering.

new_data = data[data.apply(lambda x: x['MASTER'] == x, axis=1)]

Here is what I'm trying to accomplish with my data.

BEFORE

MASTER         CENTRAL_NERVOUS  NERVOUS_SYSTEM  NEUROGENESIS
12AEX          9630013A20RIK    12AEX           2610301B20RIK
2610042L04RIK  AARS             2610042L04RIK   9630013A20RIK
2610301B20RIK  AATK             2610301B20RIK   A830010M20RIK
2700046G09RIK  ABCA2            2700046G09RIK   AU040320
31BEX          ABCB1B           31BEX           AATK
38DLP          ABCB6            7A6EX           ABCA2
7A6EX          ABL1             9630013A20RIK   ABCC8
9630013A20RIK  ABL2             97C2            ABI1
97C2           ABR              A830010M20RIK   ABI2
...            ...              ...             ...

AFTER

MASTER         CENTRAL_NERVOUS  NERVOUS_SYSTEM  NEUROGENESIS
12AEX          0                12AEX           0
2610042L04RIK  0                2610042L04RIK   0
2610301B20RIK  0                2610301B20RIK   2610301B20RIK
2700046G09RIK  0                2700046G09RIK   0
31BEX          0                31BEX           0
38DLP          0                0               0
7A6EX          0                7A6EX           0
9630013A20RIK  9630013A20RIK    9630013A20RIK   9630013A20RIK
97C2           0                97C2            0

I've been trying to do this with pandas/python but would be open to solutions w/ awk/bash or R. Any hints/future direction appreciated since I'm stuck at this point.

cutts
  • 105
  • 1
  • 6

2 Answers2

1

We can use lapply and match in R

df[-1] <- lapply(df[-1], function(x) x[match(df$MASTER, x)])
#If need to change to 0 or by default it is NA
df[is.na(df)] <- 0


#         MASTER CENTRAL_NERVOUS NERVOUS_SYSTEM  NEUROGENESIS
#1         12AEX               0          12AEX             0
#2 2610042L04RIK               0  2610042L04RIK             0
#3 2610301B20RIK               0  2610301B20RIK 2610301B20RIK
#4 2700046G09RIK               0  2700046G09RIK             0
#5         31BEX               0          31BEX             0
#6         38DLP               0              0             0
#7         7A6EX               0          7A6EX             0
#8 9630013A20RIK   9630013A20RIK  9630013A20RIK 9630013A20RIK
#9          97C2               0           97C2             0

and similarly using dplyr

library(dplyr)
df %>% mutate_at(-1, ~.[match(MASTER, .)]) 

data

If you want to convert the values to 0 make sure that the columns are characters and not factors.

df <- structure(list(MASTER = c("12AEX", "2610042L04RIK", "2610301B20RIK", 
"2700046G09RIK", "31BEX", "38DLP", "7A6EX", "9630013A20RIK", 
"97C2"), CENTRAL_NERVOUS = c("9630013A20RIK", "AARS", "AATK", 
"ABCA2", "ABCB1B", "ABCB6", "ABL1", "ABL2", "ABR"), NERVOUS_SYSTEM = c("12AEX", 
"2610042L04RIK", "2610301B20RIK", "2700046G09RIK", "31BEX", "7A6EX", 
"9630013A20RIK", "97C2", "A830010M20RIK"), NEUROGENESIS = c("2610301B20RIK", 
"9630013A20RIK", "A830010M20RIK", "AU040320", "AATK", "ABCA2", 
"ABCC8", "ABI1", "ABI2")), class = "data.frame", row.names = c(NA, -9L))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
1

IIUC you can use applymap or apply in Pandas. I'm unsure about the sort order in your example.

df[df.applymap(lambda x: x in df.MASTER.tolist())].fillna(0)

or

df[df.apply(lambda col: col.isin(df.MASTER))].fillna(0)

Result:

|   | MASTER        | CENTRAL_NERVOUS | NERVOUS_SYSTEM | NEUROGENESIS  |
|---|---------------|-----------------|----------------|---------------|
| 0 | 12AEX         | 9630013A20RIK   | 12AEX          | 2610301B20RIK |
| 1 | 2610042L04RIK | 0               | 2610042L04RIK  | 9630013A20RIK |
| 2 | 2610301B20RIK | 0               | 2610301B20RIK  | 0             |
| 3 | 2700046G09RIK | 0               | 2700046G09RIK  | 0             |
| 4 | 31BEX         | 0               | 31BEX          | 0             |
| 5 | 38DLP         | 0               | 7A6EX          | 0             |
| 6 | 7A6EX         | 0               | 9630013A20RIK  | 0             |
| 7 | 9630013A20RIK | 0               | 97C2           | 0             |
| 8 | 97C2          | 0               | 0              | 0             |

And if you want the rows sorted by MASTER then use below instead of using apply.

dfs = np.split(df,len(df.columns), axis=1)
dfs = [df.set_index(df.columns[0], drop=False) for df in dfs]
dfs[0].join(dfs[1:]).reset_index(drop=True).fillna(0)

Result:

|   | MASTER        | CENTRAL_NERVOUS | NERVOUS_SYSTEM | NEUROGENESIS  |
|---|---------------|-----------------|----------------|---------------|
| 0 | 12AEX         | 0               | 12AEX          | 0             |
| 1 | 2610042L04RIK | 0               | 2610042L04RIK  | 0             |
| 2 | 2610301B20RIK | 0               | 2610301B20RIK  | 2610301B20RIK |
| 3 | 2700046G09RIK | 0               | 2700046G09RIK  | 0             |
| 4 | 31BEX         | 0               | 31BEX          | 0             |
| 5 | 38DLP         | 0               | 0              | 0             |
| 6 | 7A6EX         | 0               | 7A6EX          | 0             |
| 7 | 9630013A20RIK | 9630013A20RIK   | 9630013A20RIK  | 9630013A20RIK |
| 8 | 97C2          | 0               | 97C2           | 0             |
Michael Gardner
  • 1,693
  • 1
  • 11
  • 13
  • I mixed up the order a bit in my expected output but I've corrected it. The second expected output is what I was shooting for. My machine ran out of memory to implement the solution though. – cutts Oct 11 '19 at 06:40