0

I am trying to merge two relatively large datasets. I am merging by SiteID - which is a unique indicator of location, and date/time, which are comprised of Year, Month=Mo, Day, and Hour=Hr.

The problem is that the merge is dropping data somewhere. Minimum, Maximum, Mean, and Median values all change, when they should be the same data, simply merged. I have made the data into characters and checked that the character strings match, yet I still lose data. I have tried left_join as well, but that doesn't seem to help. See below for more details.

EDIT: Merge is dropping data because data do not exist for every ("SiteID", "Year","Mo","Day", "Hr"). So, I needed to interpolate missing values from dB before I could merge (see answer below). END EDIT

see link at the bottom of the page to reproduce this example.

PC17$Mo<-as.character(PC17$Mo)
PC17$Year<-as.character(PC17$Year)
PC17$Day<-as.character(PC17$Day)
PC17$Hr<-as.character(PC17$Hr)
PC17$SiteID<-as.character(PC17$SiteID)

dB$Mo<-as.character(dB$Mo)
dB$Year<-as.character(dB$Year)
dB$Day<-as.character(dB$Day)
dB$Hr<-as.character(dB$Hr)
dB$SiteID<-as.character(dB$SiteID)

# confirm that data are stored as characters
str(PC17)
str(dB)

Now to compare my SiteID values, I use unique to see what character strings I have, and setdiff to see if R recognizes any as missing. One siteID is missing from each, but this is okay, because it is truly missing in the data (not a character string issue).

sort(unique(PC17$SiteID))
sort(unique(dB$SiteID))

setdiff(PC17$SiteID, dB$SiteID)  ## TR2U is the only one missing, this is ok
setdiff(dB$SiteID, PC17$SiteID)  ## FI7D is the only one missing, this is ok

Now when I look at the data (summarize by SiteID), it looks like a nice, full dataframe - meaning I have data for every site that I should have.

library(dplyr)
dB %>% 
  group_by(SiteID) %>% 
  summarise(
    min_dBL50=min(dbAL050, na.rm=TRUE),
    max_dBL50=max(dbAL050, na.rm=TRUE),
    mean_dBL50=mean(dbAL050, na.rm=TRUE),
    med_dBL50=median(dbAL050, na.rm=TRUE)
  )

# A tibble: 59 x 5
   SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
   <chr>      <dbl>     <dbl>      <dbl>     <dbl>
 1 CU1D        35.3      57.3       47.0      47.6
 2 CU1M        33.7      66.8       58.6      60.8
 3 CU1U        31.4      55.9       43.1      43.3
 4 CU2D        40        58.3       45.3      45.2
 5 CU2M        32.4      55.8       41.6      41.3
 6 CU2U        31.4      58.1       43.9      42.6
 7 CU3D        40.6      59.5       48.4      48.5
 8 CU3M        35.8      75.5       65.9      69.3
 9 CU3U        40.9      59.2       46.6      46.2
10 CU4D        36.6      49.1       43.6      43.4
# ... with 49 more rows

Here, I merge the two data sets PC17 and dB by "SiteID", "Year","Mo","Day", "Hr" - keeping all PC17 values (even if they don't have dB values to go with it; all.x=TRUE).

However, when I look at the summary of this data, now all of the SiteID have different values, and some sites are missing completely such as "CU3D" and "CU4D".

PCdB<-(merge(PC17, dB, by=c("SiteID", "Year","Mo","Day", "Hr"), all.x=TRUE))

PCdB %>% 
  group_by(SiteID) %>% 
  summarise(
    min_dBL50=min(dbAL050, na.rm=TRUE),
    max_dBL50=max(dbAL050, na.rm=TRUE),
    mean_dBL50=mean(dbAL050, na.rm=TRUE),
    med_dBL50=median(dbAL050, na.rm=TRUE)
  )

# A tibble: 59 x 5
   SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
   <chr>      <dbl>     <dbl>      <dbl>     <dbl>
 1 CU1D        47.2      54         52.3      54  
 2 CU1M        35.4      63         49.2      49.2
 3 CU1U        35.3      35.3       35.3      35.3
 4 CU2D        42.3      42.3       42.3      42.3
 5 CU2M        43.1      43.2       43.1      43.1
 6 CU2U        43.7      43.7       43.7      43.7
 7 CU3D       Inf      -Inf        NaN        NA  
 8 CU3M        44.1      71.2       57.6      57.6
 9 CU3U        45        45         45        45  
10 CU4D       Inf      -Inf        NaN        NA  
# ... with 49 more rows

I set everything to characters with as.character() in the first lines. Additionally, I have checked Year, Day, Mo, and Hr with setdiff and unique just as I did above with SiteID, and there don't appear to be any issues with those character strings not matching.

I have also tried dplyr function left_join to merge the datasets, and it hasn't made a difference.

Dylan_Gomes
  • 2,066
  • 14
  • 29
  • 4
    Are you sure you don't just have a couple of `NA` values that are making your `summarise()` calculations return `NA`? You haven't specified `min(dbAL050, na.rm=TRUE)` for instance. Any calculation involving an `NA` returns `NA` - e.g.: `min(c(1,10,NA))` – thelatemail Nov 28 '18 at 04:02
  • You say there are rows in each table unmatched in the other. So PCdB has NAs from the outer join summarizing to NAs. The "minimal" in [mcve] includes if possible using known input, intermediate & result data. Absolute basics of debugging say: Show that your program does what you expect as it goes through (sub)expressions by saying what that is & showing that it actually does it via incremental output. On adding problem code that you can't fix, research (the manual & the web). Repeat, minimizing working & wrong code. Then ask re the (small) difference between working & non-working examples. – philipxy Nov 28 '18 at 08:27
  • Thanks for comments. `na.rm=TRUE` does help get rid of a lot of the NAs, but there is still missing data somewhere. The minimums aren't as low and the maximums not as high after the merge. @thelatemail - I have added your suggestion in the question above. – Dylan_Gomes Nov 28 '18 at 16:27

2 Answers2

2

problay solved when using na.rm = TRUE in your summarising functions...

a data.table approach:

library( data.table )

dt.PC17 <- fread( "./PC_SO.csv" )
dt.dB <- fread( "./dB.csv" )

#data.table left join on "SiteID", "Year","Mo","Day", "Hr", and the summarise...
dt.PCdB <- dt.dB[ dt.PC17, on = .( SiteID, Year, Mo, Day, Hr ) ]
#summarise, and order by SiteID
result <- setorder( dt.PCdB[, list(min_dBL50  = min( dbAL050, na.rm = TRUE ),
                                   max_dBL50  = max( dbAL050, na.rm = TRUE ),
                                   mean_dBL50 = mean( dbAL050, na.rm = TRUE ),
                                   med_dBL50  = median( dbAL050, na.rm = TRUE ) 
                                   ), 
                            by = "SiteID" ], 
                    SiteID)

head( result, 10 )
#     SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
#  1:   CU1D      47.2      54.0     52.300     54.00
#  2:   CU1M      35.4      63.0     49.200     49.20
#  3:   CU1U      35.3      35.3     35.300     35.30
#  4:   CU2D      42.3      42.3     42.300     42.30
#  5:   CU2M      43.1      43.2     43.125     43.10
#  6:   CU2U      43.7      43.7     43.700     43.70
#  7:   CU3D       Inf      -Inf        NaN        NA
#  8:   CU3M      44.1      71.2     57.650     57.65
#  9:   CU3U      45.0      45.0     45.000     45.00
# 10:   CU4D       Inf      -Inf        NaN        NA

If you would like to perform a left join, but exclude hits that cannot be found (so you do not get rows like the one above on "CU3D") use:

dt.PCdB <- dt.dB[ dt.PC17, on = .( SiteID, Year, Mo, Day, Hr ), nomatch = 0L ]

this will result in:

#     SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
#  1:   CU1D      47.2      54.0     52.300     54.00
#  2:   CU1M      35.4      63.0     49.200     49.20
#  3:   CU1U      35.3      35.3     35.300     35.30
#  4:   CU2D      42.3      42.3     42.300     42.30
#  5:   CU2M      43.1      43.2     43.125     43.10
#  6:   CU2U      43.7      43.7     43.700     43.70
#  7:   CU3M      44.1      71.2     57.650     57.65
#  8:   CU3U      45.0      45.0     45.000     45.00
#  9:   CU4M      52.4      55.9     54.150     54.15
# 10:   CU4U      51.3      51.3     51.300     51.30
Wimpel
  • 26,031
  • 1
  • 20
  • 37
  • Thank you for your answer. The na.rm at least allows me to see what data does exist once NAs are removed, but the data aren't the same as before the merge, whereas they should be the exact same values, right? This also doesn't really solve the issue because data for CU3D exists (see table in question from before the merge). – Dylan_Gomes Nov 28 '18 at 16:08
1

In the end, I answered this question with a better understanding of the data. The merge function itself was not dropping any values, since it was only doing exactly as one tells it. However, since datasets were merged by SiteID, Year, Mo, Day, Hr the result was Inf, NaN, and NA values for a few SiteID.

The reason for this is that dB is not a fully continuous dataset to merge with. Thus, Inf, NaN, and NA values for some SiteID were returned because data did not overlap in all variables (SiteID, Year, Mo, Day, Hr).

So I solved this problem with interpolation. That is, I filled the missing values in based on values from dates on either side of the missing values. The package imputeTS was valuable here.

So I first interpolated the missing values in between the dates with data, and then I re-merged the datasets.

library(imputeTS)
library(tidyverse)

### We want to first interpolate dB values on the siteID first in dB dataset, BEFORE merging. 
### Why? Because the merge drops all the data that would help with the interpolation!!

dB<-read.csv("dB.csv")

dB_clean <- dB %>%
  mutate_if(is.integer, as.character)

# Create a wide table with spots for each minute. Missing will
#   show up as NA's
# All the NA's here in the columns represent 
#   missing jDays that we should add. jDay is an integer date 'julian date'
dB_NA_find <- dB_clean %>%
  count(SiteID, jDay) %>%
  spread(jDay, n)

dB_NA_find
# A tibble: 59 x 88
# SiteID `13633` `13634` `13635` `13636` `13637` `13638` `13639` `13640` `13641` 
# <fct>    <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>  
#   1 CU1D        NA      NA      NA      NA      NA      NA      NA      NA     
# 2 CU1M        NA      11      24      24      24      24      24      24      
# 3 CU1U        NA      11      24      24      24      24      24      24      
# 4 CU2D        NA      NA      NA      NA      NA      NA      NA      NA      
# 5 CU2M        NA       9      24      24      24      24      24      24      
# 6 CU2U        NA       9      24      24      24      24      21      NA      
# 7 CU3D        NA      NA      NA      NA      NA      NA      NA      NA      
# 8 CU3M        NA      NA      NA      NA      NA      NA      NA      NA      
# 9 CU3U        NA      NA      NA      NA      NA      NA      NA      NA     
# 10 CU4D        NA      NA      NA      NA      NA      NA      NA      NA     


# Take the NA minute entries and make the desired line for each
dB_rows_to_add <- dB_NA_find %>%
  gather(jDay, count, 2:88) %>%
  filter(is.na(count)) %>%
  select(-count, -NA) 

# Add these lines to the original,  remove the NA jDay rows 
#   (these have been replaced with jDay rows), and sort
dB <- dB_clean %>%
  bind_rows(dB_rows_to_add) %>%
  filter(jDay != "NA") %>%
  arrange(SiteID, jDay)


length((dB$DailyL50.x[is.na(dB$DailyL50.x)])) ## How many NAs do I have?
# [1] 3030

## Here is where we do the na.interpolation with package imputeTS
# prime the for loop with zeros
D<-rep("0",17)
sites<-unique(dB$SiteID)

for(i in 1:length(sites)){
  temp<-dB[dB$SiteID==sites[i], ]
  temp<-temp[order(temp$jDay),]
  temp$DayL50<-na.interpolation(temp$DailyL50.x, option="spline")
  D<-rbind(D, temp)
}

# delete the first row of zeros from above 'priming'
dBN<-D[-1,]

length((dBN$DayL50[is.na(dBN$DayL50)])) ## How many NAs do I have?
# [1] 0

Because I did the above interpolation of NAs based on jDay, I am missing the Month (Mo), Day, and Year information for those rows.

dBN$Year<-"2017"  #all data are from 2017

##I could not figure out how jDay was formatted, so I created a manual 'key' 
##to get Mo and Day by counting from a known date/jDay pair in original data

#Example:
# 13635 is Mo=5 Day=1
# 13665 is Mo=5 Day=31
# 13666 is Mo=6 Day=1
# 13695 is Mo=6 Day=30

key4<-data.frame("jDay"=c(13633:13634), "Day"=c(29:30), "Mo"=4)
key5<-data.frame("jDay"=c(13635:13665), "Day"=c(1:31), "Mo"=5)
key6<-data.frame("jDay"=c(13666:13695), "Day"=c(1:30), "Mo"=6)
key7<-data.frame("jDay"=c(13696:13719), "Day"=c(1:24), "Mo"=7)

#make master 'key'
key<-rbind(key4,key5,key6,key7)

# Merge 'key' with dataset so all rows now have 'Mo' and 'Day' values
dBM<-merge(dBN, key, by="jDay", all.x=TRUE)

#clean unecessary columns and rename 'Mo' and 'Day' so it matches PC17 dataset
dBM<-dBM[ , -c(2,3,6:16)]
colnames(dBM)[5:6]<-c("Day","Mo")

#I noticed an issue with duplication - merge with PC17 created a massive dataframe
dBM %>%  ### Have too many observations per day, will duplicate merge out of control.
  count(SiteID, jDay, DayL50) %>% 
  summarise(
    min=min(n, na.rm=TRUE),
    mean=mean(n, na.rm=TRUE),
    max=max(n, na.rm=TRUE)
  )

## to fix this I only kept distinct observations so that each day has 1 observation
dB<-distinct(dBM, .keep_all = TRUE)
### Now run above line again to check how many observations per day are left. Should be 1

Now when you do the merge with dB and PC17, the interpolated values (that were missing NAs before) should be included. It will look something like this:

> PCdB<-(merge(PC17, dB, by=c("SiteID", "Year","Mo","Day"), all.x=TRUE, all=FALSE,no.dups=TRUE))
> ### all.x=TRUE is important. This keeps all PC17 data, even stuff that DOESNT have dB data that corresponds to it.

> library(dplyr)

#Here is the NA interpolated 'dB' dataset 
> dB %>% 
+   group_by(SiteID) %>% 
+   dplyr::summarise(
+     min_dBL50=min(DayL50, na.rm=TRUE),
+     max_dBL50=max(DayL50, na.rm=TRUE),
+     mean_dBL50=mean(DayL50, na.rm=TRUE),
+     med_dBL50=median(DayL50, na.rm=TRUE)
+   )
# A tibble: 59 x 5
   SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
   <chr>      <dbl>     <dbl>      <dbl>     <dbl>
 1 CU1D        44.7      53.1       49.4      50.2
 2 CU1M        37.6      65.2       59.5      62.6
 3 CU1U        35.5      51         43.7      44.8
 4 CU2D        42        52         47.8      49.3
 5 CU2M        38.2      49         43.1      42.9
 6 CU2U        34.1      53.7       46.5      47  
 7 CU3D        46.1      53.3       49.7      49.4
 8 CU3M        44.5      73.5       61.9      68.2
 9 CU3U        42        52.6       47.0      46.8
10 CU4D        42        45.3       44.0      44.6
# ... with 49 more rows

# Now here is the PCdB merged dataset, and we are no longer missing values!
> PCdB %>% 
+   group_by(SiteID) %>% 
+   dplyr::summarise(
+     min_dBL50=min(DayL50, na.rm=TRUE),
+     max_dBL50=max(DayL50, na.rm=TRUE),
+     mean_dBL50=mean(DayL50, na.rm=TRUE),
+     med_dBL50=median(DayL50, na.rm=TRUE)
+   )
# A tibble: 60 x 5
   SiteID min_dBL50 max_dBL50 mean_dBL50 med_dBL50
   <chr>      <dbl>     <dbl>      <dbl>     <dbl>
 1 CU1D        44.8      50         46.8      47  
 2 CU1M        59        63.9       62.3      62.9
 3 CU1U        37.9      46         43.6      44.4
 4 CU2D        42.1      51.6       45.6      44.3
 5 CU2M        38.4      48.3       44.2      45.5
 6 CU2U        39.8      50.7       45.7      46.4
 7 CU3D        46.5      49.5       47.7      47.7
 8 CU3M        67.7      71.2       69.5      69.4
 9 CU3U        43.3      52.6       48.1      48.2
10 CU4D        43.2      45.3       44.4      44.9
# ... with 50 more rows
Dylan_Gomes
  • 2,066
  • 14
  • 29