Merged differently sized dataframes but column returns 'NA'

Question

I have two dataframes I need to merge together. The first one is called "rain_year" and the second is called "refugee." rain_year (df1) has a column called "sum_precip". Refugee (df2) has a column called "refugees".

Both dataframes have columns "Year" and "origin" in common and this is what I need to match on. df1 is more than twice the size of df2.

Here is an example of df1:

Year origin sum_precip

1951 Afghanistan 315

1952 Afghanistan 300

1953 Afghanistan 280

and so on and so forth for all countries going back to 1951.

Here is an example of df2:

Year origin refugees

1951 Afghanistan 500

1952 Afghanistan 400

1953 Afghanistan 350

and so on for all countries going back to 1951.

The two dataframes differ in size because not all countries in df2 produced refugees in all years from 1951-present. As such, I expect there to be a considerable amount of 'NA' when merging either df1 to df2 or vice versa.

When I run this code:

refugee_rain <- join(refugee, rain_year)

it creates a new dataframe with a vector of "NA" in sum_precip. In other words, the correct column is added, but all the values are missing.

When I run this code:

refugee_rain <- join(refugee, rain_year)

the same thing happens.

When I reverse the order of the dataframes, the column of "refugees" does the same thing. That is, there is a column labeled "refugees," but it is a vector of "NA" instead of numerical values.

I have searched high and low for a solution for hours. I have tried several variations, including "all=T". This does not work. Please help.

This is the closest I can find that someone else has a similar issue. However, the suggested solution did not work for me.

At the suggestion of others, I have run the dput operation which returned the following:

> dput(head(refugee, 5))
structure(list(Year = 1979:1983, origin = c("Afghanistan", "Afghanistan", 
"Afghanistan", "Afghanistan", "Afghanistan"), ISO = c("", "", 
"", "", ""), asylum = c("-", "-", "-", "-", "-"), asylum_ISO = c("-", 
"-", "-", "-", "-"), refugees = c(2116011L, 1952928L, 1847304L, 
1749628L, 1717966L), asylum_seekers = c(0L, 0L, 0L, 0L, 0L), 
    IDP = c(0L, 0L, 0L, 0L, 0L), Ven_displaced = c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_)), class = c("data.table", 
"data.frame"), row.names = c(NA, -5L), .internal.selfref = <pointer: 0x000001ff790e1ef0>)

> dput(head(rain_year, 5))
structure(list(Year = c(1951L, 1951L, 1951L, 1951L, 1951L), origin = c(" Afghanistan", 
" Albania", " Algeria", " Andorra", " Angola"), sum_precip = c(314.98117, 
1077.5157, 99.31164, 1012.9107, 1181.94411)), row.names = c(NA, 
-5L), groups = structure(list(Year = 1951:2016, .rows = structure(list(
    1:195, 196:390, 391:585, 586:780, 781:975, 976:1170, 1171:1365, 
    1366:1560, 1561:1755, 1756:1950, 1951:2145, 2146:2340, 2341:2535, 
    2536:2730, 2731:2925, 2926:3120, 3121:3315, 3316:3510, 3511:3705, 
    3706:3900, 3901:4095, 4096:4290, 4291:4485, 4486:4680, 4681:4875, 
    4876:5070, 5071:5265, 5266:5460, 5461:5655, 5656:5850, 5851:6045, 
    6046:6240, 6241:6435, 6436:6630, 6631:6825, 6826:7020, 7021:7215, 
    7216:7410, 7411:7605, 7606:7800, 7801:7995, 7996:8190, 8191:8385, 
    8386:8580, 8581:8775, 8776:8970, 8971:9165, 9166:9360, 9361:9555, 
    9556:9750, 9751:9945, 9946:10140, 10141:10335, 10336:10530, 
    10531:10725, 10726:10920, 10921:11115, 11116:11310, 11311:11505, 
    11506:11700, 11701:11895, 11896:12090, 12091:12285, 12286:12480, 
    12481:12675, 12676:12870), ptype = integer(0), class = c("vctrs_list_of", 
"vctrs_vctr", "list"))), row.names = c(NA, -66L), class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x000001ff790e1ef0>)

I am not sure what I am looking at and apologize if that is not the correct format to post this.

does `library(data.table); setDT(refugee); setDT(rain_year); refugee_rain = merge(refugee, rain_year, by = c("Year", "origin"))` work for you? — PavoDive, May 01 '21 at 21:49
Hi @PavoDive ~ Thank you for your response. This code returned a dataframe with all the correct variables, however no data is filled in. "0 observations of 10 variables." — Dylan, May 01 '21 at 21:55
I'll suggest you check the structure of both data frames with `str(refugee)` and `str(rain_year`. The variables used for the merging should be of the same class in both objects. Maybe one is character in one df and numeric in the other one? — PavoDive, May 01 '21 at 21:58
Edit your post and instead of "this is an example of df2" paste the result to `dput(head(df2, 5))` (for both df1 and df2) — PavoDive, May 01 '21 at 21:59
RE: structure ~ the str code reveals they are the same structure in both dataframes (int for 'Year' and chr for 'origin'). — Dylan, May 01 '21 at 22:02
Use `dput` so we can help. With the available data is not possible — PavoDive, May 01 '21 at 22:04
your `rain_year` variable is grouped. It seems your group_by it somewhere and forgot to ungroup — PavoDive, May 01 '21 at 22:16
Yes, the rain variable was initially logged as precipitation by month but I needed it by year. I grouped by year and country so I could sum the yearly total for each country. How do I ungroup? — Dylan, May 01 '21 at 22:19

score 1 · Accepted Answer · answered May 01 '21 at 22:25

1

This one is difficult to spot:

See that in refugee the name of the Country is "Afghanistan", while in rain_year it is " Afghanistan": There's a space before the A.

You can solve that like this:

library(stringi)
rain_year[, origin := stri_trim_both(origin)]

And then performing the merge:

refuggee_rainYear <- merge(refugee, rain_year, by = c("Year", "origin"))

answered May 01 '21 at 22:25

PavoDive

6,322
2
29
55

FWIW if you convert to tibble the leading space will be visible in the output – moodymudskipper Sep 27 '22 at 12:52

Merged differently sized dataframes but column returns 'NA'

1 Answers1