0

I'm trying to download NOAA data using the rnoaa package and I'm running into a bit of trouble.

I took a vector from a dataframe and it looks like this:

df <- dataframe$ghcnd

Grabbing necessary column

This gives me an output like:

[1] "GHCND:US1AKAB0058" "GHCND:US1AKAB0015" "GHCND:US1AKAB0021" "GHCND:US1AKAB0061"
 [5] "GHCND:US1AKAB0055" "GHCND:US1AKAB0038" "GHCND:US1AKAB0051" "GHCND:US1AKAB0052"
 [9] "GHCND:US1AKAB0060" "GHCND:US1AKAB0065" "GHCND:US1AKAB0062" "GHCND:US1AKFN0016"
[13] "GHCND:US1AKFN0018" "GHCND:US1AKFN0015" "GHCND:US1AKFN0011" "GHCND:US1AKFN0013"
[17] "GHCND:US1AKFN0030" "GHCND:US1AKJB0011" "GHCND:US1AKJB0014" "GHCND:US1AKKP0005"
[21] "GHCND:US1AKMS0011" "GHCND:US1AKMS0019" "GHCND:US1AKMS0012" "GHCND:US1AKMS0020"
[25] "GHCND:US1AKMS0018" "GHCND:US1AKMS0014" "GHCND:US1AKPW0001" "GHCND:US1AKSH0002"
[29] "GHCND:US1AKVC0006" "GHCND:US1AKWH0012" "GHCND:US1AKWP0001" "GHCND:US1AKWP0002"
[33] "GHCND:US1ALAT0014" "GHCND:US1ALAT0013" "GHCND:US1ALBW0095" "GHCND:US1ALBW0087"
[37] "GHCND:US1ALBW0020" "GHCND:US1ALBW0066" "GHCND:US1ALBW0031" "GHCND:US1ALBW0082"
[41] "GHCND:US1ALBW0099" "GHCND:US1ALBW0040" "GHCND:US1ALBW0004" "GHCND:US1ALBW0085"
[45] "GHCND:US1ALBW0009" "GHCND:US1ALBW0001" "GHCND:US1ALBW0094" "GHCND:US1ALBW0013"
[49] "GHCND:US1ALBW0079" "GHCND:US1ALBW0060"

In reality, I have about 22,000 weather stations. This is just showing the first 50.

rnoaa code

library(rnoaa)
options("noaakey" = Sys.getenv("noaakey"))
Sys.getenv("noaakey")

weather <- ncdc(datasetid = 'GHCND', stationid = df, var = 'PRCP', startdate = "2020-05-30",
                enddate = "2020-05-30", add_units = TRUE)

Which produces the following error: Error: Request-URI Too Long (HTTP 414)

However, when I subset the df into just, say, the first 100 entries, I can't get data for more than the first 25. However, the package details say I should be able to run 10,000 queries a day.

Loop Attempt

df1 <- df[1:125] ## Splitting dataframe. Too big otherwise

for (i in 1:length(df1)){
  weather2<-ncdc(datasetid = 'GHCND', stationid=df1[i],var='PRCP',startdate ='2020-06-30',enddate='2020-06-30',
          add_units = TRUE)
  
}

But this just producing a dataframe of a single row, that row being the 125th weather station.

If anyone could give advise on what to try next that would be great :)

Also, cross linked: https://discuss.ropensci.org/t/rnoaa-getting-county-level-rain-data/2403

2 Answers2

1

In your loop attempt, weather2 is overwritten on each iteration of the loop.

Since the number of requests and the length of the return is unknown, one way to solve this problem is to wrap the call to ncdc inside a lapply statement and save each response in a list. Then at the end of the lapply statement merge all the data into one large dataframe.

library(rnoaa)
library(dplyr)

stationlist <-ghcnd_stations() %>% filter(state == "DE")
df <- paste0("GHCND:", stationlist$id[1:10]) 

#call request data multiple time and store individual results in a list 
 output<-lapply(df, function(station){
    weather <- ncdc(datasetid = 'GHCND', stationid = station, var = 'PRCP', startdate = "2020-05-30",
                    enddate = "2020-05-30", add_units = TRUE)
    #weather$data
    #to include the meta data
    data.frame(t(unlist(weather$meta)), weather$data)
 })
 
 #merge into 1 data frame
 answer <-bind_rows(output)

I would verify this process on a small subset of stations as the call to NOAA can be slow. I attempt to reduce the down the number of stations searched to the area of interest and to the ones still actively collecting data.

Also concerning the limit request.
From the help page: "Note that the default limit (no. records returned) is 25. Look at the metadata in $meta to see how many records were found. If more were found than 25, you could set the parameter limit to something higher than 25."

Dave2e
  • 22,192
  • 18
  • 42
  • 50
  • OK I got it working I believe! ```z <- split(df, ceiling(seq_along(df)/100))``` ```out <- list()``` ```for (i in seq_along(z)) { ``` ```out[[i]] <- ncdc(datasetid = 'GHCND', stationid = z[[i]], var = 'PRCP',``` ```startdate = "2020-05-30", enddate = "2020-05-30",``` ```add_units = TRUE, limit = 100)``` ```}``` My output is a list of 219 elements, each has two elements. "Meta" and "Data" What I'm interested in is combining the rows from the 219 out[[i]]$data. Would this require a for loop or can I use bind_rows? – Tobin Brooks Mar 17 '21 at 17:43
  • Reading my last comment, maybe that wasn't totally clear. Essentially, I have a list of 219 elements. And each element is a list with two elements. So for list 1, I have out[[1]]$meta and out[[1]]$data. I want to combine the rows for out[[1]]$data, out[[2]]$data...out[[219]]$data – Tobin Brooks Mar 17 '21 at 18:08
0

Figured it out, with a lot of help from @Dave2e and a bud on the ropensci link above.

df <- cleaned_emshr$ghcnd  ## Grabbing necessary column

z <- split(df, ceiling(seq_along(df)/100))
out <- list()
for (i in seq_along(z)) {
  out[[i]] <- ncdc(datasetid = 'GHCND', stationid = z[[i]], var = 'PRCP', 
                   startdate = "2020-05-30", enddate = "2020-05-30", 
                   add_units = TRUE, limit = 100)
}

weather <- bind_rows(lapply(out, "[[", "data"))
  • 1
    Tobin, I made an edit to my code above to include the meta data in the dataframe. While your approach works, it is requiring 2 loops (for and lapply). It is better practice to get the the out list to be correct on the first pass. – Dave2e Mar 17 '21 at 22:09