1

I have two dataframe which I'm trying to join which should be straight forward but I see some anomalous behavior.

Dataframe A

Name  Sample           Country      Path
John   S18902             UK     /Home/drive/John 
BOB    135671             USA        /Home/drive/BOB
Tim    GB12345_serum_63   UK     /Home/drive/Tim 
Wayne  12345_6789         UK     /Home/drive/Wayne

Dataframe B

 Surname    Sample            State        FILE
  Paul     S18902              NJ        John.csv 
  Gem      135671              PP        BOB.csv
  Love     GB12345_serum_63    AP        Tim.csv 
  Dave     12345_6789          MK        Wayne.csv

I am using R markdown to do a simple join using the following command Dataframec <- DataframeA %>% left_join(DataframeB ,by = "Sample",all.x=T )

All rows join apart from the row where sample== GB12345_serum_63

There should be a simple fix to this but I'm out of ideas.

Thank you

Kp_Data
  • 21
  • 3
  • Can you paste into your question the output of `dput()` for each data sample (e.g., `dput(Dataframe_A[1:4,])`, assuming the first data frame is called `Dataframe_A` and you've provided the first 4 rows in your question). This will allow us to look at the exact structure of each data frame. Also, you can remove `all.x=T`. `left_join` inherently keeps all rows of the first data frame and the `all.x` argument has no effect. – eipi10 Mar 12 '19 at 16:41
  • 1
    I suspect there's whitespace in one or both of the Sample columsn in your dataframes. Can you check if DataframeA$Sample["GB12345_serum_63"] == DataFrameB$sample["GB12345_serum_63"] if that's false, try str_trim on your Sample columns. read up `?str_trim – infominer Mar 12 '19 at 16:44
  • @infominer DataframeA$Sample["GB12345_serum_63"] == DataFrameB$sample["GB12345_serum_63"] gives me an NA which is interesting because i can see these samples in the excell sheet. What's also interesting is "GB12345_serum_63" %in% DataframeB gives me a false. – Kp_Data Mar 12 '19 at 17:06
  • 1
    @kp_data, yup extra whitespace or maybe even an apostrophe in the cell that won't be displayed unless you click on the cell and see the formula bar. Do a trim_ws= TRUE when you use read_excel. Also make sure it's not being converted to factor – infominer Mar 12 '19 at 17:19

1 Answers1

1

If you are cutting-and-pasting your data directly into your question then the reason for this is because your key values are technically different due to having different numbers of spaces.

I cut and paste from your question from the beginning of the value to the start of the adjacent column name. So to 'country' in the first case and to 'state' in the second case DataframeA: "GB12345_serum_63" DataframeB: "GB12345_serum_63 "

You can see for DataframeB there are 3 space characters after the value. This can be resolved by removing extra whitespace from your key values as follows using regular expressions: gsub("^\\s+|\\s+$", "", x)

DataframeA$Sample <- gsub("^\\s+|\\s+$", "", DataframeA$Sample)
DataframeB$Sample <- gsub("^\\s+|\\s+$", "", DataframeB$Sample)

Now your join should work

Soren
  • 1,792
  • 1
  • 13
  • 16
  • 1
    base R has a built-in function for this: `?trimws` ... (also `stringr::str_trim`) – Ben Bolker Mar 12 '19 at 17:56
  • Thanks, good to know! All this while I've been using regex. Great to learn new base functions hiding in plain sight! :) – Soren Mar 12 '19 at 18:00