0

I have two DataFiles, and I need to match and extract the Information

File1

Chr,Start,End,ID
6,38517417,38517437,kgp17152035
6,38517556,38517576,rs4254983
6,38517997,38518017,kgp10250023
6,38519465,38519485,kgp17245206
6,38519751,38519771,kgp8446980

File2

Gene_ID,Gene_Name,RE_Locus
ENSG00000112164.5,GLP1R,chr6:39041458-39041477
ENSG00000112164.5,GLP1R,chr6:39053087-39053106
ENSG00000112164.5,GLP1R,chr6:39049954-39049973
ENSG00000112164.5,GLP1R,chr6:39041701-39041720
ENSG00000112164.5,GLP1R,chr6:39047953-39047972

Here I would like to Match the range of the RE_Locus Column in File2 with File1 (Start & End Columns) as range and Whenever it finds a matching range it has to give their ID

Desired Output:

ID
tsp17152035
rs874983
kgp10250023
rsp17245206
ex8446980
paul raj
  • 13
  • 6

1 Answers1

1

Check below code

import pandas as pd

df = pd.read_csv('Book1 copy.csv') #file1

df1 = pd.read_csv('Book1.csv') #file2

pd.merge(df, df1, 
         left_on=(df['Start'].astype('str')+'-'+df['End'].astype('str'))
         ,right_on = df1['RE_Locus'].str.slice(start=5 ), how='inner')[['ID']]

Output:

enter image description here

Abhishek
  • 1,585
  • 2
  • 12
  • 15
  • Hello @Abhishek, small correction in File2. The above code works fine for previous files. Now in file2 like above, I have a range which lies at start & end of file1 this time it has to map and return valid ID – paul raj Jul 11 '22 at 06:53
  • I am not sure what the change. Is above code not working? – Abhishek Jul 11 '22 at 06:56
  • File1 has 1500 rows and File2 has 161 rows. In fact, I have just given the sample data above. This time I need to check the RE_Locsus range over file1 (Start & End Columns) and extract ID whenever it finds a match – paul raj Jul 11 '22 at 07:03
  • Hi, I am not really sure if I understood the problem. You output says "sp17152035" which is not available in File1 – Abhishek Jul 11 '22 at 07:26
  • Ya What I mean to say is File1 has 1500 rows, so "sp17152035" this id may lie somewhere which matches the particular range – paul raj Jul 11 '22 at 07:35
  • need python logic how to iterate over 1500 rows and find match range and gives id as output – paul raj Jul 11 '22 at 07:37
  • you dont need loop, above code will work any number of rows as in this MERGE is getting used, just read each data in respective dataframe – Abhishek Jul 11 '22 at 08:50
  • As mentioned in comment - file1 goes in df & file2 goes in df1 – Abhishek Jul 11 '22 at 08:57