Is there a way to minimize the number of unique combination?

Question

Trying to request ERA5 data. The request is limited by size, and the system will auto reject any requests bigger than the limit. However, one wants to be as close to the request limit as possible as each request takes few hours to be processed by Climate Data Store (CDS).

For example, I have a vector of years <- seq(from = 1981, to = 2019, by = 1) and a vector of variables <- c("a", "b", "c", "d", "e"...., "z"). The max request size is 11. Which means length(years) * length(variables) must be smaller or equal to 11.

For each request, I have to provide a list containing character vectors for years and variables. For example: req.list <- list(year = c("1981", "1982", ..."1991"), variable = c("a")) This will work since there are 11 years and 1 variable.

I thought about using expand.grid() then use row 1-11, row 12-22, ...and unique() value each column to get the years and variable for request. But this approach sometimes will lead to request size too big: req.list <- list(year = c("2013", "2014", ..."2018"), variable = c("a", "b")) is rejected since length(year) * length(variable) = 12 > 11.

Also I am using foreach() and doParallel to create multiple requests (max 15 requests at a time)

If anyone has a better solution please share (minimize the number of unique combos while obeying the request size limit), thank you very much.

It is not a programmable solution, but perhaps it helps. If I take your example from above with 39 years and 26 variables, you can think of your problem as dividing a 26x39 matrix into as many 1x11 matrices as possible.You could create 26*3 requests (size = 11) and are left with a 26x6 matrix. Then you can do 6*2 requests (size = 11) and are left with a 6x4 matrix. This cannot be divided in 1x11 matrices anymore, so here 3 requests with size = 8 are the best choice. In the end you would have 26*3 + 6*2 = 90 maximum requests and only 3 requests under the limit. — Gilean0709, Mar 03 '20 at 08:55

score 0 · Answer 1 · answered Mar 04 '20 at 14:49

The limit is set in terms of number of fields, which one can think of as number of "records" in the grib sense. Usually the approach suggested is to leave the list of variables, and shorter timescales in the retrieval command and then loop over the years (longer times). This is a matter of choice though for ERA5 as the data is all on cache, not on tape drive, with tape drive based requests it is important to retrieve data on the same tape with a single request (i.e. if you use the CDS to retrieve seasonal forecasts or other datasets that are not ERA5).

this is a simple looped example:

import cdsapi

c = cdsapi.Client()

yearlist=[str(s) for s in range(1979,2019)]

for year in yearlist:
    c.retrieve(
    'reanalysis-era5-single-levels',
    {
        'product_type': 'reanalysis',
        'format': 'netcdf',
        'variable': [
            '10m_u_component_of_wind', '10m_v_component_of_wind', '2m_dewpoint_temperature',
            '2m_temperature',
        ],
        'year': year,
        'month': [
            '01', '02', '03',
            '04', '05', '06',
            '07', '08', '09',
            '10', '11', '12',
        ],
        'day': [
            '01', '02', '03',
            '04', '05', '06',
            '07', '08', '09',
            '10', '11', '12',
            '13', '14', '15',
            '16', '17', '18',
            '19', '20', '21',
            '22', '23', '24',
            '25', '26', '27',
            '28', '29', '30',
            '31',
        ],
        'time': [
            '00:00', '01:00', '02:00',
            '03:00', '04:00', '05:00',
            '06:00', '07:00', '08:00',
            '09:00', '10:00', '11:00',
            '12:00', '13:00', '14:00',
            '15:00', '16:00', '17:00',
            '18:00', '19:00', '20:00',
            '21:00', '22:00', '23:00',
        ],
    },
    'data'+year+'.nc')

I presume you can parallelize this with foreach although I've never tried, I'm presuming it won't help too much as there is a job limit per user which is set quite low, so you will just end up with a large number of jobs in the queue there...

Is there a way to minimize the number of unique combination?

1 Answers1