I have two dataframes that I want to compare. I already know that the values in dataframe one (df1$BP) are not within the range of values in dataframe two (df2$START and df2$STOP), but I want to return the row in dataframe two where df2$START or df2$STOP is closest in value to df1$BP, where the "Chr" column matches between datasets (df1$Chr, df2$Chr).
I have managed to do this (see bottom of Q), but it seems SERIOUSLY unwieldy, and I wondered if there was a more parsimonious manner of achieving the same thing.
So for dataframe one (df1), we have:
df1=data.frame(SNP=c("rs79247094","rs13325007"),
Chr=c(2,3),
BP=c(48554955,107916058))
df1
SNP Chr BP
rs79247094 2 48554955
rs13325007 3 107916058
For dataframe two, we have:
df2=data.frame(clump=c(1,2,3,4,5,6,7,8),
Chr=c(2,2,2,2,3,3,3,3),
START=c(28033538,37576136,58143438,60389362,80814042,107379837,136288405,161777035),
STOP=c(27451538,36998607,57845065,60242162,79814042,107118837,135530405,161092491))
df2
Clump Chr START STOP
1 2 28033538 27451538
2 2 37576136 36998607
3 2 58143438 57845065
4 2 60389362 60242162
5 3 80814042 79814042
6 3 107379837 107118837
7 3 136288405 135530405
8 3 161777035 161092491
I am interested in returning which START/STOP values are closest to BP. Ideally, I could return the row, and what the difference between BP and START or STOP is (df3$Dist), like:
df3
Clump Chr START STOP SNP BP Dist
3 2 58143438 57845065 rs79247094 48554955 9290110
6 3 107379837 107118837 rs13325007 107916058 536221
I have found similar questions, for example: Return rows establishing a "closest value to" in R
But these are finding the closest values based on a fixed value, rather than a value that changes (and matching on the Chr column).
My long winded method is:
df3<-right_join(df1,df2,by="Chr")
to give me all of the combinations of df1 and df2 together.
df3$start_dist<-abs(df3$START-df3$BP)
to create a column with the absolute difference between START and BP
df3$stop_dist<-abs(df3$STOP-df3$BP)
to create a column with the absolute difference between STOP and BP
df3$dist.compare<-ifelse(df3$start_dist<df3$stop_dist,df3$start_dist,df3$stop_dist)
df3<-df3[with(df3,order(SNP,"dist.compare")),]
to create a column (dist.compare) which prints the smallest difference between BP and START or STOP (and the re-order by that column)
df3<- df3 %>% group_by(SNP) %>% mutate(Dist = first(dist.compare))
to create a column (Dist) which prints the minimum value from df3$dist.compare
df3<-df3[which(df3$dist.compare==df3$Dist),c("clump","Chr","START","STOP","SNP","BP","Dist")]
df3<-df3[order(df3$clump),]
to only print rows where dist.compare matches Dist (so the minimum value), and drop the intermediate columns, and tidy up by re-ordering by clump. Now that gets me to where I want to be:
df3
Clump Chr START STOP SNP BP Dist
3 2 58143438 57845065 rs79247094 48554955 9290110
6 3 107379837 107118837 rs13325007 107916058 536221
But I feel like it is very very convoluted, and wondered if anyone had any tips on how to refine that process?
thanks in advance