Why is left_join creating NAs when values seem to match in the by="x" argument?

Question

I am trying to carry out a left_join between two dataframes, called multi_scenario and production_targets. I am trying to carry out the join based on the following code, using a left_join based on the matched column "mean_needed" :

library(dplyr)
comb <- left_join(multi_scenario, production_targets, by = "mean_needed")

Here are my two dataframes

1.multi_scenario:

structure(list(scenario_ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 
5L, 5L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 
3L, 3L, 3L, 3L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 
5L, 5L, 5L, 5L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 
4L, 4L, 5L, 5L, 5L, 5L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 1L, 1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 
3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 
3L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L), yield.spect = c(3, 
13, 19, 22, 23, 14, 16, 20, 23, 1, 7, 16, 26, 35, 4, 6, 10, 33, 
47, 5, 6, 10, 28, 64, 7, 8, 10, 29, 59, 7, 12, 28, 35, 36, 12, 
13, 15, 27, 58, 13, 16, 25, 57, 4, 12, 22, 33, 54, 2, 17, 28, 
29, 50, 19, 21, 23, 28, 35, 14, 17, 25, 52, 21, 23, 24, 26, 45, 
20, 21, 23, 30, 45, 9, 21, 31, 57, 12, 14, 50, 0, 10, 37, 44, 
55, 2, 9, 34, 48, 54, 3, 28, 31, 40, 44, 13, 17, 28, 38, 50, 
14, 26, 29, 34, 49, 13, 30, 31, 37, 41, 14, 15, 17, 44, 64, 11, 
13, 15, 56, 57, 7, 25, 30, 44, 47, 2, 30, 32, 38, 58, 18, 25, 
26, 36, 55, 13, 29, 32, 34, 57, 13, 26, 38, 40, 49, 1, 4, 44, 
55, 63, 10, 21, 36, 40, 58, 23, 25, 38, 57, 6, 18, 44, 50, 54, 
17, 27, 32, 39, 63, 14, 20, 34, 51, 60, 5, 6, 55, 56, 57, 11, 
27, 50, 64, 22, 25, 40, 44, 54, 18, 24, 41, 43, 59, 24, 26, 35, 
45, 63, 7, 21, 37, 62, 64, 13, 37, 46, 48, 49, 3, 19, 46, 62, 
7, 50, 63, 65, 28, 30, 37, 49, 56, 2, 38, 42, 59, 64, 21, 30, 
47, 53, 56, 13, 31, 41, 59, 62, 8, 27, 49, 62, 65, 20, 21, 51, 
60, 31, 36, 39, 45, 62, 36, 37, 42, 45, 60, 18, 33, 52, 56, 59, 
43, 44, 51, 27, 43, 49, 53, 61, 30, 41, 47, 53, 62, 18, 49, 53, 
60, 64), number_parcel = c(1000, 1000, 1000, 1000, 1000, 2000, 
1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 
1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 
1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 
1000, 2000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 
1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 2000, 
1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 
1000, 2000, 1000, 1000, 1000, 2000, 2000, 1000, 1000, 1000, 1000, 
1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 
1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 
1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 
1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 
1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 
1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 
1000, 1000, 1000, 1000, 1000, 2000, 1000, 1000, 1000, 1000, 1000, 
1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 
1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 2000, 1000, 1000, 
1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 
1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 
1000, 1000, 1000, 1000, 1000, 1000, 2000, 2000, 1000, 1000, 1000, 
1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 
1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 
1000, 1000, 1000, 1000, 1000, 1000, 2000, 1000, 1000, 1000, 1000, 
1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 
1000, 2000, 2000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 
1000, 1000, 1000, 1000, 1000, 1000, 1000), mean_needed = c(15.9204, 
15.9204, 15.9204, 15.9204, 15.9204, 17.2471, 17.2471, 17.2471, 
17.2471, 17.2471, 17.2471, 17.2471, 17.2471, 17.2471, 19.9005, 
19.9005, 19.9005, 19.9005, 19.9005, 22.5539, 22.5539, 22.5539, 
22.5539, 22.5539, 22.5539, 22.5539, 22.5539, 22.5539, 22.5539, 
23.8806, 23.8806, 23.8806, 23.8806, 23.8806, 25.2073, 25.2073, 
25.2073, 25.2073, 25.2073, 25.2073, 25.2073, 25.2073, 25.2073, 
25.2073, 25.2073, 25.2073, 25.2073, 25.2073, 25.2073, 25.2073, 
25.2073, 25.2073, 25.2073, 25.2073, 25.2073, 25.2073, 25.2073, 
25.2073, 26.534, 26.534, 26.534, 26.534, 27.8607, 27.8607, 27.8607, 
27.8607, 27.8607, 27.8607, 27.8607, 27.8607, 27.8607, 27.8607, 
27.8607, 27.8607, 27.8607, 27.8607, 27.8607, 27.8607, 27.8607, 
29.1874, 29.1874, 29.1874, 29.1874, 29.1874, 29.1874, 29.1874, 
29.1874, 29.1874, 29.1874, 29.1874, 29.1874, 29.1874, 29.1874, 
29.1874, 29.1874, 29.1874, 29.1874, 29.1874, 29.1874, 30.5141, 
30.5141, 30.5141, 30.5141, 30.5141, 30.5141, 30.5141, 30.5141, 
30.5141, 30.5141, 30.5141, 30.5141, 30.5141, 30.5141, 30.5141, 
30.5141, 30.5141, 30.5141, 30.5141, 30.5141, 30.5141, 30.5141, 
30.5141, 30.5141, 30.5141, 31.8408, 31.8408, 31.8408, 31.8408, 
31.8408, 31.8408, 31.8408, 31.8408, 31.8408, 31.8408, 33.1675, 
33.1675, 33.1675, 33.1675, 33.1675, 33.1675, 33.1675, 33.1675, 
33.1675, 33.1675, 33.1675, 33.1675, 33.1675, 33.1675, 33.1675, 
33.1675, 33.1675, 33.1675, 33.1675, 33.1675, 33.1675, 33.1675, 
33.1675, 33.1675, 34.4942, 34.4942, 34.4942, 34.4942, 34.4942, 
35.8209, 35.8209, 35.8209, 35.8209, 35.8209, 35.8209, 35.8209, 
35.8209, 35.8209, 35.8209, 35.8209, 35.8209, 35.8209, 35.8209, 
35.8209, 35.8209, 35.8209, 35.8209, 35.8209, 37.1476, 37.1476, 
37.1476, 37.1476, 37.1476, 37.1476, 37.1476, 37.1476, 37.1476, 
37.1476, 38.4743, 38.4743, 38.4743, 38.4743, 38.4743, 38.4743, 
38.4743, 38.4743, 38.4743, 38.4743, 38.4743, 38.4743, 38.4743, 
38.4743, 38.4743, 38.4743, 38.4743, 38.4743, 38.4743, 38.4743, 
38.4743, 38.4743, 38.4743, 39.801, 39.801, 39.801, 39.801, 39.801, 
41.1277, 41.1277, 41.1277, 41.1277, 41.1277, 41.1277, 41.1277, 
41.1277, 41.1277, 41.1277, 41.1277, 41.1277, 41.1277, 41.1277, 
41.1277, 42.4544, 42.4544, 42.4544, 42.4544, 42.4544, 42.4544, 
42.4544, 42.4544, 42.4544, 42.4544, 42.4544, 42.4544, 42.4544, 
42.4544, 43.7811, 43.7811, 43.7811, 43.7811, 43.7811, 43.7811, 
43.7811, 43.7811, 43.7811, 43.7811, 46.4345, 46.4345, 46.4345, 
46.4345, 46.4345, 46.4345, 46.4345, 46.4345, 46.4345, 46.4345, 
46.4345, 46.4345, 46.4345, 49.0879, 49.0879, 49.0879, 49.0879, 
49.0879)), row.names = c(NA, -277L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x0000000000321ef0>)

production_targets:

structure(list(rel_prod = c(0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 
0.8, 0.85, 0.9, 0.95, 1, 1.05, 1.1, 1.15, 1.2, 1.25, 1.3, 1.35, 
1.4, 1.45, 1.5, 1.55, 1.6, 1.65, 1.7, 1.75, 1.8, 1.85, 1.9, 1.95, 
2), production_target = c(66335, 72968.5, 79602, 86235.5, 92869, 
99502.5, 106136, 112769.5, 119403, 126036.5, 132670, 139303.5, 
145937, 152570.5, 159204, 165837.5, 172471, 179104.5, 185738, 
192371.5, 199005, 205638.5, 212272, 218905.5, 225539, 232172.5, 
238806, 245439.5, 252073, 258706.5, 265340), mean_needed = c(13.267, 
14.5937, 15.9204, 17.2471, 18.5738, 19.9005, 21.2272, 22.5539, 
23.8806, 25.2073, 26.534, 27.8607, 29.1874, 30.5141, 31.8408, 
33.1675, 34.4942, 35.8209, 37.1476, 38.4743, 39.801, 41.1277, 
42.4544, 43.7811, 45.1078, 46.4345, 47.7612, 49.0879, 50.4146, 
51.7413, 53.068)), row.names = c(NA, -31L), class = "data.frame")

I was expecting the above code to add two new columns (rel_production and production_target) to my new dataframe, comb, with associated values. However, for some reason, my new dataframe has some areas where NAs are being shown instead of the expected values. This only seems to be happening for certain observations of mean_needed. But when I look at the values for mean_needed in the two dataframes I am trying to join, they look identical. There aren't any extra hidden spaces, and both columns are numeric. I include a print to show that mean_needed values seem to be the same in both columns. Is there a reason I am getting NAs instead of expected values? Thanks

comb dataframe, showing NAs for certain mean_values. This allocation of NA is repeated at other mean_values after the join

Values I am trying to left join by:

production_targets$mean_needed

13.2670 14.5937 15.9204 17.2471 18.5738 19.9005 21.2272 22.5539 23.8806 25.2073 26.5340 27.8607 29.1874 30.5141 31.8408 33.1675 34.4942 35.8209 37.1476 38.4743 39.8010 41.1277 42.4544 43.7811 45.1078 46.4345 47.7612 49.0879 50.4146 51.7413 53.0680

comb$mean.needed %>% unique()

14.5937 15.9204 17.2471 18.5738 19.9005 21.2272 22.5539 25.2073 26.5340 27.8607 29.1874 30.5141 31.8408 33.1675 34.4942 35.8209 37.1476 38.4743 39.8010 41.1277 42.4544 43.7811 45.1078 46.4345 47.7612 49.0879 27] 50.4146

Hello, and welcome! When sharing your dataset (`x`), please share the output of `dput(x)` rather than a screenshot of `View(x)`. This makes the example reproducible. — Greg, Jun 28 '21 at 17:52
Thanks! I will add the two dataframes I am trying to join in an edit — Gianluca Cerullo, Jun 28 '21 at 18:34
I think I have found the problem and solution. So by= is supposed to be a character vector. So in my code I needed to turn "mean_needed" into a character vector in both my dataframes before running left_join. The following code solved it: multi_scenario$mean_needed <- as.character(multi_scenario$mean_needed) production_targets$mean_needed <- as.character(production_targets$mean_needed) comb<- left_join(multi_scenario, production_targets, by = "mean_needed") — Gianluca Cerullo, Jun 28 '21 at 19:03
Hmmmm...I think that's not what the R [documentation](https://dplyr.tidyverse.org/reference/join.htm) means when it says _"`by` A character vector of variables to join by."_ It means that it expects the _names_ of the `join` columns to be specified in a character vector: `left_join(table_1, table_2, by = c("column_a", "column_b", "column_c")`. In your case, you have only one join field, so `by = c("mean_needed")` is equivalent to `by = "mean_needed"`, since both qualify as _"character vectors"_. — Greg, Jun 28 '21 at 19:11
While you are correct that `join` expects the join column `mean_needed` to have the same datatype in `multi_scenario` as it does in `production_targets`, that data type is **not required** to be of type `character`. I was unable to replicate your `NA` situation on my computer: RStudio threw an error for the element `.internal.selfref = ` in your `dput` output for `multi_scenario`; but as soon as I removed that element, the `left_join()` worked perfectly as expected. — Greg, Jun 28 '21 at 19:18

Why is left_join creating NAs when values seem to match in the by="x" argument?

0 Answers0