Working with rnoaa
package to take add US station IDs to a df of weather events. Below is str()
for the rain
df.
google drive link to csv file of subset
'data.frame': 4395 obs. of 63 variables:
$ YEAR : int 2009 2009 2012 2013 2013 2015 2007 2007 2007
$ msa_code : int 29180 29180 29180 12260 12260 12260 23540 23540
$ zip : int 22001 22001 22001 45003 45003 45003 12001 12001
$ state : chr "LA" "LA" "LA" "SC" ...
$ gdp : int 23495 23495 27346 20856 20856 22313 10119 10119
$ EVENT_TYPE : chr "Heavy Rain" "Heavy Rain" "Heavy Rain" "Heavy
$ WFO : chr "LCH" "LCH" "LCH" "CAE" ...
$ latitude : num 30.4 30.2 30.2 33.4 33.5 ...
$ longitude : num -92.4 -92.4 -92.2 -81.6 -81.9 ...
$ SUM_DAMAGES : num 0 0 0 0 0 0 0 0 0 0 ...
Omitting a bunch of variables that aren't relevant to this, here is a snippet of the rain
df
X CZ_NAME YEAR full state name msa_code msa_name.x zip
49 ACADIA 2009 LOUISIANA 29180 Lafayette, LA 22001
60 ACADIA 2009 LOUISIANA 29180 Lafayette, LA 22001
91 ACADIA 2012 LOUISIANA 29180 Lafayette, LA 22001
761 AIKEN 2013 SOUTH CAROLINA 12260 Augusta-Richmond County, GA-SC 45003
770 AIKEN 2013 SOUTH CAROLINA 12260 Augusta-Richmond County, GA-SC 45003
809 AIKEN 2015 SOUTH CAROLINA 12260 Augusta-Richmond County, GA-SC 45003
latitude longitude
-92.4200 30.4300
-92.3700 30.2200
-92.2484 30.2354
-81.6400 33.4361
-81.8800 33.5400
-81.7000 33.5300
Here is a snippet of the ghcnd_stations()
tibble, which the rnoaa
documentation recommends assigning so it doesn't have to call it each time.
# A tibble: 6 × 11
id latitude longitude elevation state name
<chr> <dbl> <dbl> <dbl> <chr> <chr>
1 US009052008 43.7333 -96.6333 482 SD SIOUX FALLS (ENVIRON. CANADA)
2 US009052008 43.7333 -96.6333 482 SD SIOUX FALLS (ENVIRON. CANADA)
3 US009052008 43.7333 -96.6333 482 SD SIOUX FALLS (ENVIRON. CANADA)
4 US009052008 43.7333 -96.6333 482 SD SIOUX FALLS (ENVIRON. CANADA)
5 US10adam001 40.5680 -98.5069 598 NE JUNIATA 1.5 S
6 US10adam001 40.5680 -98.5069 598 NE JUNIATA 1.5 S
# ... with 5 more variables: gsn_flag <chr>, wmo_id <chr>, element <chr>,
# first_year <int>, last_year <int>
So far I've been able to use the ghcnd_stations()
command to call up a list of stations, then, after removing non-CONUS stations, taking the lat/lon coordinates of those stations, use fuzzyjoin::geo_inner_join
to compare the two lists and merge in the closest stations.
subset <- head(rain)
subset_join <- geo_inner_join(subset, stations, by = c("latitude", "longitude"), max_dist = 5)
I took a subset of my data and tried to run this and it works, but when I try to run that code on the entire dataset I'm faced with memory.size
errors:
Error: cannot allocate vector of size 2.9 Gb
In addition: Warning messages:
1: In fuzzy_join(x, y, multi_by = by, multi_match_fun = match_fun, :
Reached total allocation of 8017Mb: see help(memory.size)
I've tried uisng memory.size = 9000
, and tried to read up on upping memory size, but I'm still receiving an error. memory.size(max = TRUE)
returns this:
> memory.size(max = TRUE)
[1] 7013
Is there a more efficient way to do this, or am I going to have to slice up my df, run the code, and then rbind
it back together?
Just for context, here is sys.info()
Sys.info()
sysname release version nodename
"Windows" ">= 8 x64" "build 9200" "DESKTOP-G88LPOJ"
machine login user effective_user
"x86-64" "franc" "franc" "franc"
First question! Let me know if I haven't included anything relevant. Thanks!