I am trying to merge two relatively large datasets. I am merging by SiteID - which is a unique indicator of location, and date/time, which are comprised of Year, Month=Mo, Day, and Hour=Hr.
The problem is that the merge
is dropping data somewhere. Minimum, Maximum, Mean, and Median values all change, when they should be the same data, simply merged. I have made the data into characters and checked that the character strings match, yet I still lose data. I have tried left_join
as well, but that doesn't seem to help. See below for more details.
EDIT: Merge is dropping data because data do not exist for every ("SiteID", "Year","Mo","Day", "Hr")
. So, I needed to interpolate missing values from dB
before I could merge (see answer below).
END EDIT
see link at the bottom of the page to reproduce this example.
PC17$Mo<-as.character(PC17$Mo)
PC17$Year<-as.character(PC17$Year)
PC17$Day<-as.character(PC17$Day)
PC17$Hr<-as.character(PC17$Hr)
PC17$SiteID<-as.character(PC17$SiteID)
dB$Mo<-as.character(dB$Mo)
dB$Year<-as.character(dB$Year)
dB$Day<-as.character(dB$Day)
dB$Hr<-as.character(dB$Hr)
dB$SiteID<-as.character(dB$SiteID)
# confirm that data are stored as characters
str(PC17)
str(dB)
Now to compare my SiteID
values, I use unique to see what character strings I have, and setdiff
to see if R recognizes any as missing. One siteID is missing from each, but this is okay, because it is truly missing in the data (not a character string issue).
sort(unique(PC17$SiteID))
sort(unique(dB$SiteID))
setdiff(PC17$SiteID, dB$SiteID) ## TR2U is the only one missing, this is ok
setdiff(dB$SiteID, PC17$SiteID) ## FI7D is the only one missing, this is ok
Now when I look at the data (summarize by SiteID), it looks like a nice, full dataframe - meaning I have data for every site that I should have.
library(dplyr)
dB %>%
group_by(SiteID) %>%
summarise(
min_dBL50=min(dbAL050, na.rm=TRUE),
max_dBL50=max(dbAL050, na.rm=TRUE),
mean_dBL50=mean(dbAL050, na.rm=TRUE),
med_dBL50=median(dbAL050, na.rm=TRUE)
)
# A tibble: 59 x 5
SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
<chr> <dbl> <dbl> <dbl> <dbl>
1 CU1D 35.3 57.3 47.0 47.6
2 CU1M 33.7 66.8 58.6 60.8
3 CU1U 31.4 55.9 43.1 43.3
4 CU2D 40 58.3 45.3 45.2
5 CU2M 32.4 55.8 41.6 41.3
6 CU2U 31.4 58.1 43.9 42.6
7 CU3D 40.6 59.5 48.4 48.5
8 CU3M 35.8 75.5 65.9 69.3
9 CU3U 40.9 59.2 46.6 46.2
10 CU4D 36.6 49.1 43.6 43.4
# ... with 49 more rows
Here, I merge the two data sets PC17 and dB by "SiteID", "Year","Mo","Day", "Hr"
- keeping all PC17 values (even if they don't have dB values to go with it; all.x=TRUE
).
However, when I look at the summary of this data, now all of the SiteID
have different values, and some sites are missing completely such as "CU3D" and "CU4D".
PCdB<-(merge(PC17, dB, by=c("SiteID", "Year","Mo","Day", "Hr"), all.x=TRUE))
PCdB %>%
group_by(SiteID) %>%
summarise(
min_dBL50=min(dbAL050, na.rm=TRUE),
max_dBL50=max(dbAL050, na.rm=TRUE),
mean_dBL50=mean(dbAL050, na.rm=TRUE),
med_dBL50=median(dbAL050, na.rm=TRUE)
)
# A tibble: 59 x 5
SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
<chr> <dbl> <dbl> <dbl> <dbl>
1 CU1D 47.2 54 52.3 54
2 CU1M 35.4 63 49.2 49.2
3 CU1U 35.3 35.3 35.3 35.3
4 CU2D 42.3 42.3 42.3 42.3
5 CU2M 43.1 43.2 43.1 43.1
6 CU2U 43.7 43.7 43.7 43.7
7 CU3D Inf -Inf NaN NA
8 CU3M 44.1 71.2 57.6 57.6
9 CU3U 45 45 45 45
10 CU4D Inf -Inf NaN NA
# ... with 49 more rows
I set everything to characters with as.character()
in the first lines. Additionally, I have checked Year
, Day
, Mo
, and Hr
with setdiff
and unique
just as I did above with SiteID
, and there don't appear to be any issues with those character strings not matching.
I have also tried dplyr
function left_join
to merge the datasets, and it hasn't made a difference.