0

I would like to merge two dataframes based on similar patterns in the chromosome column. I made various attempts with R & BASH such as with "data.table" "tidyverse", & merge(). Could someone help me by providing alternative solutions in R, BASH, Python, Perl, etc. for solving this solution? I would like to merge based on the chromosome information and retain both counts/RXNs.

NOTE: These two DFs are not aligned and I am also curious what happens if some values are missing.

Thanks and Cheers:

DF1:

Chromosome;RXN;ID
1009250;q9hxn4;NA
1010820;p16256;NA
31783;p16588;"PNTOt4;PNTOt4pp"
203;3-DEHYDROQUINATE-DEHYDRATASE-RXN;"DHQTi;DQDH"

DF2:

Chromosome;Count1;Count2;Count3;Count4;Count5
203;1;31;1;0;0;0
1010820;152;7;0;11;4
1009250;5;0;0;17;0
31783;1;0;0;0;0;0

Expected Result:

Chromosome;RXN;Count1;Count2;Count3;Count4;Count5
1009250;q9hxn4;5;0;0;17;0
1010820;p16256;152;7;0;11;4
31783;p16588;1;0;0;0;0
203;3-DEHYDROQUINATE-DEHYDRATASE-RXN;1;31;1;0;0;0
brian d foy
  • 129,424
  • 31
  • 207
  • 592
  • so `Chromosome1` in `df1` should match `Chromosome2` in `df2`? – Vivek Kalyanarangan May 16 '21 at 18:15
  • 2
    ```Chromosome1``` & ```Chromosome2``` in your example have no match. Do you want to just join the two tables? – justgoodin May 16 '21 at 18:16
  • Can you articulate more what "a matching pattern that is similar" means for your use case? It sounds subjective, and programming solutions work best with objective definitions. – Jon Spring May 16 '21 at 18:56
  • It seems that the expected result here relies on different DF2 data that what you have provided, which will make it impossible for us to test any potential solution. Can you please fix that? – Jon Spring May 16 '21 at 19:01
  • Thank you all for your feedback so far. I adjusted both datasets and clarified the question. I would like to match the 'Chromosome' column head found in both files – Jonathan Josephs-Spaulding May 17 '21 at 04:30

2 Answers2

0

If I understand your request correctly, this should do it in Python. I've made the Chromosome column into the index of each DataFrame.

from io import StringIO

txt1 = '''Chromosome;RXN;ID
1009250;q9hxn4;NA
1010820;p16256;NA
31783;p16588;"PNTOt4;PNTOt4pp"
203;3-DEHYDROQUINATE-DEHYDRATASE-RXN;"DHQTi;DQDH"'''

txt2 = """Chromosome;Count1;Count2;Count3;Count4;Count5;Count6
203;1;31;1;0;0;0
1010820;152;7;0;11;4
1009250;5;0;0;17;0
31783;1;0;0;0;0;0"""

df1 = pd.read_csv(
    StringIO(txt1),
    sep=';',
    index_col=0,
    header=0
)

df2 = pd.read_csv(
    StringIO(txt2),
    sep=';',
    index_col=0,
    header=0
)
DF1:
                                         RXN               ID
Chromosome
1009250                               q9hxn4              NaN
1010820                               p16256              NaN
31783                                 p16588  PNTOt4;PNTOt4pp
203         3-DEHYDROQUINATE-DEHYDRATASE-RXN       DHQTi;DQDH

DF2:
            Count1  Count2  Count3  Count4  Count5  Count6
Chromosome
203              1      31       1       0       0     0.0
1010820        152       7       0      11       4     NaN
1009250          5       0       0      17       0     NaN
31783            1       0       0       0       0     0.0
result = pd.concat(
    [df1.sort_index(), df2.sort_index()],
    axis=1
)
print(result)
                                         RXN               ID  Count1  Count2  Count3  Count4  Count5  Count6
Chromosome
203         3-DEHYDROQUINATE-DEHYDRATASE-RXN       DHQTi;DQDH       1      31       1       0       0     0.0
31783                                 p16588  PNTOt4;PNTOt4pp       1       0       0       0       0     0.0
1009250                               q9hxn4              NaN       5       0       0      17       0     NaN
1010820                               p16256              NaN     152       7       0      11       4     NaN

The concat command also handles mismatched indices by simply filling in NaN values for columns in e.g. df1 if df2 doesn't have have the same index, and vice versa.

Nick W
  • 146
  • 9
  • Hey @Nick, that was very useful as a start for me. What does it mean that you made the 'Chromosome' column into an index of each DF? How did you do it in this case? I tried: `df1.set_index('Chromosome',inplace=True) ` But it did not provide the results that I expected. I still have two unique columns with 'Chromosome'. Thanks again ! – Jonathan Josephs-Spaulding May 17 '21 at 10:01
  • Thank you. I've now added the code where I load the data and set the Chromosome column as the index. I've also added a column 'Count6', since it seemed to be missing to match with the number of columns present. – Nick W May 17 '21 at 10:14
  • Thank you for specifying how to set a column as an index However, I am using two large files, not strings. I am continuing to hit the error: `raise InvalidIndexError( pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects` I thought the issue would be that I have duplicates in the row that was indexed, but it has not changed much – Jonathan Josephs-Spaulding May 17 '21 at 12:36
  • `df1 = pd.read_csv('DF1.csv', sep=';',index_col=0,header=0) print (df1) df1.index df2 = pd.read_csv ('DF1.csv', sep=';',index_col=0,header=0) print (df2) df2.index result = pd.concat( [df1.sort_index(), df2.sort_index()], axis=1) #Error arises here print(result) result.to_csv('test.csv', index=False)` – Jonathan Josephs-Spaulding May 17 '21 at 12:38
  • It sounds like there are duplicate Chromosome entries. I'd suggest checking with one of the answers here: https://stackoverflow.com/questions/50242968/check-for-duplicate-values-in-pandas-dataframe-column – Nick W May 17 '21 at 12:45
  • Otherwise, can you share the data files? The error is you get is hard to troubleshoot without all the data. – Nick W May 18 '21 at 11:20
  • Thanks Nick, the issue was with the duplications which I cleared up today. Now I am trying to figure out what the 'Chromosome' columns from both files are not matching with the remaining row data. – Jonathan Josephs-Spaulding May 18 '21 at 12:53
0

As bash was mentioned in the text body, I offer you an awk solution. The dataframes are in files df1 and df2:

$ awk '
BEGIN {
    FS=OFS=";"         # input and output field delimiters
}
NR==FNR {              # process df1
    a[$1]=$2           # hash to an array, 1st is the key, 2nd the value
    next               # process next record
}
{                      # process df2
    $2=(a[$1] OFS $2)  # prepend RXN field to 2nd field of df2
}1' df1 df2            # 1 is output command, mind the file order

The 2 last lines could be written perhaps more clearly:

...
{
    print $1,a[$1],$2,$3,$4,$5,$6
}' df1 df2

Output:

Chromosome;RXN;Count1;Count2;Count3;Count4;Count5
203;3-DEHYDROQUINATE-DEHYDRATASE-RXN;1;31;1;0;0;0
1010820;p16256;152;7;0;11;4
1009250;q9hxn4;5;0;0;17;0
31783;p16588;1;0;0;0;0;0

Output will be in the order of df2. Chromosome present in df1 but not in df2 will not be included. Chromosome in df2 but not in df1 will be output from df2 with empty RXN field. Also, if there are duplicate chromosomes in df1, the last one is used. This can be fixed if it is an issue.

James Brown
  • 36,089
  • 7
  • 43
  • 59
  • 1
    This solution is very elegant and clean, thank you for suggesting this solution. I spent days trying to use AWK, but could not figure it out. You are truly a BASH Wizard :D – Jonathan Josephs-Spaulding May 25 '21 at 11:57