R: Issues when pulling values from a nested list into a dataframe

Question

so this should be a relatively easy question on pulling items in a list into a dataframe, but I'm stuck on something.

I have the following list (I'm showing just part of the list for you, it's far longer than this):

str(raw_jobs_list)

List of 2
 $ :List of 4
  ..$ id    : chr "3594134"
  ..$ score : int 1
  ..$ fields:List of 16
  .. ..$ date             :List of 3
  .. .. ..$ changed: chr "2020-04-18T00:35:00+00:00"
  .. .. ..$ created: chr "2020-04-07T11:15:37+00:00"
  .. .. ..$ closing: chr "2020-04-17T00:00:00+00:00"
  .. ..$ country          :List of 1
  .. .. ..$ :List of 6
  .. .. .. ..$ href     : chr "https://api.reliefweb.int/v1/countries/149"
  .. .. .. ..$ name     : chr "Mali"
  .. .. .. ..$ location :List of 2
  .. .. .. .. ..$ lon: num -1.25
  .. .. .. .. ..$ lat: num 17.4
  .. .. .. ..$ id       : int 149
  .. .. .. ..$ shortname: chr "Mali"
  .. .. .. ..$ iso3     : chr "mli"
  .. ..$ title            : chr "REGIONAL MANAGER West Africa"

I tried pulling them out using:

jobs_data_df <- list.stack(list.select(raw_jobs_list, 
                                       fields$title, 
                                       fields$country$name,
                                       fields$date$created))

Where raw_jobs_list is the list, but I get these NAs and am not sure how to get past it.

glimpse(jobs_data_df)
Rows: 2
Columns: 3
$ V1 <chr> "REGIONAL MANAGER West Africa", "Support Relief Group Public Health Advisor (Multiple Positions)"
$ V2 <lgl> NA, NA
$ V3 <chr> "2020-04-07T11:15:37+00:00", "2020-05-04T15:20:37+00:00"

It's possible there's something obvious I'm overlooking as I haven't worked much with lists before. Any ideas?

Thanks so much! C

PS. If you're interested, I'm working with this API and this is how I got there so far.

jobs <- GET(url = "https://api.reliefweb.int/v1/jobs?appname=apidoc&preset=analysis&profile=full&limit=2")


raw_jobs_list <- content(jobs)$data

The portion displayed above is a subset of the whole data; here is a portion of the first element of the list:

dput(lapply(raw_jobs_list, function(x) c(x[c("id","score")], list(fields=x[[3]][intersect(names(x[[3]]),c("date","country","title"))]))))

list(list(id = "3594134", score = 1L, fields = list(date = list(
    changed = "2020-04-18T00:35:00+00:00", created = "2020-04-07T11:15:37+00:00", 
    closing = "2020-04-17T00:00:00+00:00"), country = list(list(
    href = "https://api.reliefweb.int/v1/countries/149", name = "Mali", 
    location = list(lon = -1.25, lat = 17.35), id = 149L, shortname = "Mali", 
    iso3 = "mli")), title = "REGIONAL MANAGER West Africa")), 
    list(id = "3594129", score = 1L, fields = list(date = list(
        changed = "2020-05-19T00:04:01+00:00", created = "2020-05-04T15:20:37+00:00", 
        closing = "2020-05-18T00:00:00+00:00"), title = "Support Relief Group Public Health Advisor (Multiple Positions)")))

I don't think anybody is going to want to transcribe your `str`-rendering into a usable list. Please [edit] your question and paste the output of `dput(raw_jobs_list)`, thank you. — r2evans, Apr 27 '21 at 15:21
https://tidyr.tidyverse.org/articles/rectangle.html may help. — dash2, Apr 27 '21 at 16:05
Thanks @r2evans. The output is a bit too heavy with dput(raw_jobs_list), but I've added the initial code up there now, so you can easily run it to try! :) ``` — kodikai, Apr 28 '21 at 12:01
I've suggested an edit with (essentially) the portion of the output you had originally provided in the `str` output. I included the second element since it is different-enough from the first that one should know to look for missing fields. — r2evans, Apr 28 '21 at 13:16

score 1 · Accepted Answer · answered Apr 28 '21 at 13:29

If you look at just one element at a time, I think that as.data.frame does a pretty decent job. While I'll demonstrate using the abbreviated data (that I edited into your question), and the first element looks like:

raw_jobs_sublist <- lapply(raw_jobs_list, function(x) c(x[c("id","score")], list(fields=x[[3]][intersect(names(x[[3]]),c("date","country","title"))])))

as.data.frame(raw_jobs_sublist[[1]])
#        id score       fields.date.changed       fields.date.created       fields.date.closing                        fields.country.href fields.country.name fields.country.location.lon fields.country.location.lat fields.country.id fields.country.shortname fields.country.iso3                 fields.title
# 1 3594134     1 2020-04-18T00:35:00+00:00 2020-04-07T11:15:37+00:00 2020-04-17T00:00:00+00:00 https://api.reliefweb.int/v1/countries/149                Mali                       -1.25                       17.35               149                     Mali                 mli REGIONAL MANAGER West Africa

Shown differently (just for variety here), it's

str(as.data.frame(raw_jobs_sublist[[1]]))
# 'data.frame': 1 obs. of  13 variables:
#  $ id                         : chr "3594134"
#  $ score                      : int 1
#  $ fields.date.changed        : chr "2020-04-18T00:35:00+00:00"
#  $ fields.date.created        : chr "2020-04-07T11:15:37+00:00"
#  $ fields.date.closing        : chr "2020-04-17T00:00:00+00:00"
#  $ fields.country.href        : chr "https://api.reliefweb.int/v1/countries/149"
#  $ fields.country.name        : chr "Mali"
#  $ fields.country.location.lon: num -1.25
#  $ fields.country.location.lat: num 17.4
#  $ fields.country.id          : int 149
#  $ fields.country.shortname   : chr "Mali"
#  $ fields.country.iso3        : chr "mli"
#  $ fields.title               : chr "REGIONAL MANAGER West Africa"

In order to do this on all elements, we need to account for a few things:

not all elements have all fields, so whatever method we use needs to "fill" in the blanks;
we don't want to do it iteratively, let's just combine them all at once.

Here's a first stab:

dplyr::bind_rows(lapply(raw_jobs_sublist, as.data.frame))
#        id score       fields.date.changed       fields.date.created       fields.date.closing                        fields.country.href fields.country.name fields.country.location.lon fields.country.location.lat fields.country.id fields.country.shortname fields.country.iso3                                                    fields.title
# 1 3594134     1 2020-04-18T00:35:00+00:00 2020-04-07T11:15:37+00:00 2020-04-17T00:00:00+00:00 https://api.reliefweb.int/v1/countries/149                Mali                       -1.25                       17.35               149                     Mali                 mli                                    REGIONAL MANAGER West Africa
# 2 3594129     1 2020-05-19T00:04:01+00:00 2020-05-04T15:20:37+00:00 2020-05-18T00:00:00+00:00                                       <NA>                <NA>                          NA                          NA                NA                     <NA>                <NA> Support Relief Group Public Health Advisor (Multiple Positions)

This also works with data.table::rbindlist. It does not work as well with do.call(rbind.data.frame, ...), since that is less tolerant of missing names. (This it can be done without too much trouble, there are occasionally other advantages to using these two options.)

Note: if you do this on the original data, R's default mechanism of displaying a data.frame will cramp your console with all of the text, which might be annoying. If you are already using dplyr or data.table in any of your work, both of those formats provide string-limiting, so that it is more tolerable on the console. For example, showing the whole thing:

tibble::tibble(dplyr::bind_rows(lapply(raw_jobs_list, as.data.frame)))
# # A tibble: 2 x 42
#   id    score fields.date.cha~ fields.date.cre~ fields.date.clo~ fields.country.~ fields.country.~ fields.country.~ fields.country.~ fields.country.~ fields.country.~ fields.country.~ fields.career_c~ fields.career_c~ fields.name fields.source.h~ fields.source.n~ fields.source.id fields.source.t~ fields.source.t~ fields.source.s~ fields.source.h~ fields.title fields.body
#   <chr> <int> <chr>            <chr>            <chr>            <chr>            <chr>                       <dbl>            <dbl>            <int> <chr>            <chr>            <chr>                       <int> <chr>       <chr>            <chr>                       <int> <chr>                       <int> <chr>            <chr>            <chr>        <chr>      
# 1 3594~     1 2020-04-18T00:3~ 2020-04-07T11:1~ 2020-04-17T00:0~ https://api.rel~ Mali                        -1.25             17.4              149 Mali             mli              Donor Relations~            20966 Bamako      https://api.rel~ ICCO COOPERATION            45059 Non-governmenta~              274 ICCO COOPERATION https://www.icc~ REGIONAL MA~ "**VACANCY~
# 2 3594~     1 2020-05-19T00:0~ 2020-05-04T15:2~ 2020-05-18T00:0~ <NA>             <NA>                        NA                NA                 NA <NA>             <NA>             Program/Project~             6867 <NA>        https://api.rel~ US Agency for I~             1751 Government                    271 USAID            http://www.usai~ Support Rel~ "### **SOL~
# # ... with 18 more variables: fields.type.name <chr>, fields.type.id <int>, fields.experience.name <chr>, fields.experience.id <int>, fields.url <chr>, fields.url_alias <chr>, fields.how_to_apply <chr>, fields.id <int>, fields.status <chr>, fields.body.html <chr>, fields.how_to_apply.html <chr>, href <chr>, fields.source.longname <chr>, fields.source.spanish_name <chr>,
# #   fields.theme.name <chr>, fields.theme.id <int>, fields.theme.name.1 <chr>, fields.theme.id.1 <int>

data.table::rbindlist(lapply(raw_jobs_list, as.data.frame), fill = TRUE)
#         id score       fields.date.changed       fields.date.created       fields.date.closing                     fields.country.href fields.country.name fields.country.location.lon fields.country.location.lat fields.country.id fields.country.shortname fields.country.iso3     fields.career_categories.name fields.career_categories.id fields.name
#     <char> <int>                    <char>                    <char>                    <char>                                  <char>              <char>                       <num>                       <num>             <int>                   <char>              <char>                            <char>                       <int>      <char>
# 1: 3594134     1 2020-04-18T00:35:00+00:00 2020-04-07T11:15:37+00:00 2020-04-17T00:00:00+00:00 https://api.reliefweb.int/v1/countri...                Mali                       -1.25                       17.35               149                     Mali                 mli Donor Relations/Grants Management                       20966      Bamako
# 2: 3594129     1 2020-05-19T00:04:01+00:00 2020-05-04T15:20:37+00:00 2020-05-18T00:00:00+00:00                                    <NA>                <NA>                          NA                          NA                NA                     <NA>                <NA>        Program/Project Management                        6867        <NA>
# 27 variables not shown: [fields.source.href <char>, fields.source.name <char>, fields.source.id <int>, fields.source.type.name <char>, fields.source.type.id <int>, fields.source.shortname <char>, fields.source.homepage <char>, fields.title <char>, fields.body <char>, fields.type.name <char>, ...]

For data.table, I already have some options set that facilitate this. Notably, I'm currently using:

options(
  datatable.prettyprint.char = 36,
  datatable.print.topn = 10,
  datatable.print.class = TRUE,
  datatable.print.trunc.cols = TRUE
)

At this point, you have a data.frame that should contain all of the data (and NA for elements with missing fields). From here, if you don't like the nested-names convention (e.g., fields.date.changed), then they can be easily renamed using patterns or conventional methods.

Brilliant, @r2evans worked like a charm! Thank you so much, and for the comprehensive explanations too on how this works. I also managed to make it work with thousands of entries now and for additional values to get pulled from the list. I've marked the question as answered, but was wondering, if you had a second, if you could explain the logic behind what this function here does? I struggled to reverse engineer it ------------- `raw_jobs_sublist <- lapply(raw_jobs_list, function(x) c(x[c("id","score")], list(fields=x[[3]][intersect(names(x[[3]]),c("date","country","title"))])))` — kodikai, Apr 28 '21 at 15:13
That is solely to provide a much-smaller sample dataset for the question. You commented that it was *"bit too heavy with dput"*, and I agree (based on the large-text fields), so I just trimmed down to approximately the fields/subfields present in your first `str` output. Nested-lists take a bit more care to subset some fields and keep all of others, ergo the somewhat-complicated approach. It was just to provide small, representative data. Ignore it in your production. Glad it worked well for you! — r2evans, Apr 28 '21 at 15:21

R: Issues when pulling values from a nested list into a dataframe

1 Answers1