0

Working with rnoaa package to take add US station IDs to a df of weather events. Below is str() for the rain df.

google drive link to csv file of subset

'data.frame':   4395 obs. of  63 variables:
 $ YEAR               : int  2009 2009 2012 2013 2013 2015 2007 2007 2007  
 $ msa_code           : int  29180 29180 29180 12260 12260 12260 23540 23540  
 $ zip                : int  22001 22001 22001 45003 45003 45003 12001 12001 
 $ state              : chr  "LA" "LA" "LA" "SC" ...
 $ gdp                : int  23495 23495 27346 20856 20856 22313 10119 10119 
 $ EVENT_TYPE         : chr  "Heavy Rain" "Heavy Rain" "Heavy Rain" "Heavy 
 $ WFO                : chr  "LCH" "LCH" "LCH" "CAE" ...
 $ latitude           : num  30.4 30.2 30.2 33.4 33.5 ...
 $ longitude          : num  -92.4 -92.4 -92.2 -81.6 -81.9 ...
 $ SUM_DAMAGES        : num  0 0 0 0 0 0 0 0 0 0 ...

Omitting a bunch of variables that aren't relevant to this, here is a snippet of the rain df

X CZ_NAME YEAR full state name msa_code msa_name.x    zip
49  ACADIA 2009 LOUISIANA      29180    Lafayette, LA 22001
60  ACADIA 2009 LOUISIANA      29180    Lafayette, LA 22001
91  ACADIA 2012 LOUISIANA      29180    Lafayette, LA 22001
761 AIKEN  2013 SOUTH CAROLINA 12260    Augusta-Richmond County, GA-SC 45003
770 AIKEN  2013 SOUTH CAROLINA 12260    Augusta-Richmond County, GA-SC 45003
809 AIKEN  2015 SOUTH CAROLINA 12260    Augusta-Richmond County, GA-SC 45003
latitude longitude
-92.4200 30.4300 
-92.3700 30.2200 
-92.2484 30.2354 
-81.6400 33.4361 
-81.8800 33.5400
-81.7000 33.5300

Here is a snippet of the ghcnd_stations() tibble, which the rnoaa documentation recommends assigning so it doesn't have to call it each time.

# A tibble: 6 × 11
       id latitude longitude elevation state                          name
    <chr>    <dbl>     <dbl>     <dbl> <chr>                         <chr>
1 US009052008  43.7333  -96.6333       482    SD SIOUX FALLS (ENVIRON. CANADA)
2 US009052008  43.7333  -96.6333       482    SD SIOUX FALLS (ENVIRON. CANADA)
3 US009052008  43.7333  -96.6333       482    SD SIOUX FALLS (ENVIRON. CANADA)
4 US009052008  43.7333  -96.6333       482    SD SIOUX FALLS (ENVIRON. CANADA)
5 US10adam001  40.5680  -98.5069       598    NE                 JUNIATA 1.5 S
6 US10adam001  40.5680  -98.5069       598    NE                 JUNIATA 1.5 S
# ... with 5 more variables: gsn_flag <chr>, wmo_id <chr>, element <chr>,
#   first_year <int>, last_year <int> 

So far I've been able to use the ghcnd_stations() command to call up a list of stations, then, after removing non-CONUS stations, taking the lat/lon coordinates of those stations, use fuzzyjoin::geo_inner_join to compare the two lists and merge in the closest stations.

subset <- head(rain)
subset_join <- geo_inner_join(subset, stations, by = c("latitude", "longitude"), max_dist = 5)

I took a subset of my data and tried to run this and it works, but when I try to run that code on the entire dataset I'm faced with memory.size errors:

Error: cannot allocate vector of size 2.9 Gb
In addition: Warning messages:
1: In fuzzy_join(x, y, multi_by = by, multi_match_fun = match_fun,  :
  Reached total allocation of 8017Mb: see help(memory.size) 

I've tried uisng memory.size = 9000, and tried to read up on upping memory size, but I'm still receiving an error. memory.size(max = TRUE) returns this:

> memory.size(max = TRUE)
[1] 7013

Is there a more efficient way to do this, or am I going to have to slice up my df, run the code, and then rbind it back together?

Just for context, here is sys.info()

Sys.info()
      sysname           release           version          nodename 
    "Windows"        ">= 8 x64"      "build 9200" "DESKTOP-G88LPOJ" 
      machine             login              user    effective_user 
     "x86-64"           "franc"           "franc"           "franc" 

First question! Let me know if I haven't included anything relevant. Thanks!

Francisco
  • 169
  • 1
  • 9
  • Author of `rnoaa`, but it's not an `rnoaa` problem. Can't reproduce since the data.frame is not avail. Not familiar with the `fuzzyjoin` pkg, looks like it internally uses `geosphere`, and you can pass on args to its fxns, maybe toggle something in those fxns – sckott Feb 03 '17 at 16:42
  • @sckott it isn't an rnoaa problem. I'm just looking for ideas on how to implement more efficiently. I will edit to add a gdrive link to a small csv. To be honest I just don't know enough about how R works. Maybe using some form of `apply` and `cbind`-ing the station ids to just do one row at a time? – Francisco Feb 03 '17 at 18:20
  • still not reproducible with the file, could you edit above so that it is – sckott Feb 04 '17 at 00:43

0 Answers0