0

EDIT 1: the problem in more simple terms (for the whole issue, check the Original Edit)

How can I unlist a list of key, values pairs within a dataframe, knowing that the number of pairs may vary.

For instance:

   _source.types                  _source.label
1 key1, key2, value1, value2      label1
2 NULL                            label1
3 key3, value3                    label2

Note that (key1, key2, value1, value2) is a <data.frame>

Expected result:

  types.k1  types.v1  types.k2  types.v2  label
1 key1      value1    key2      value2    label1
2 NULL      NULL      NULL      NULL      label1
3 key3      value3    NULL      NULL      label2

I've tried unnest, unlist,... without success as I have always an error due to the number of elements or the class of the object.

ORIGINAL EDIT I have a result from a Search request to an elastic search base, using elastic package. As the query is a loop from terms within a pre-existing dataframe, I have a list of responses for each term.

#existing dataframe
df <- data.frame(id=c("1","2"),terms=(c("Guy de Maupassant","Vincent Cassel")))

#loop query to ES
query_es <- '{"_source": ["id", "label", "types", "subTypes"],
"query":{"bool":{"must":[{"term":{"label":"%s"}}]}}}'

out = list()
for (i in seq_along(df$terms)) {
  out[[i]] <- Search(index = "index_1", 
                     body = sprintf(query_es, df$terms[i]), 
                     size = 3, asdf=TRUE)$hits$hits
}

The result is a list of lists like this (I just display the first result for clarity) :

[[1]]
_index   _type      _id   _score                             _source.types
1 index_1 triplet    Q9327 13.18037 Q5, dbPedia.Person, être humain, personne
2 index_1 triplet Q3122270 13.17847          Q11424, dbPedia.Film, film, film
                                                                                      _source.subTypes     _source.label _source.id
1 Q1930187, Q36180, Q15949613, Q6625963, Q214917, journaliste, écrivain, nouvelliste, romancier, dramaturge Guy de Maupassant      Q9327
2                                                                                                      NULL Guy de Maupassant   Q3122270

As you can see, I have 2 possible results for the first term: a writer or a movie, each one having a list of {id,value} for the types and the subTypes. In order to have a more comprehensive view, I re-arrange the output:

out2 <- bind_rows(out, .id = "id")
out2 <- out_i_bind2[,-c(2:5)]
colnames(out2) <- c("id","types","subTypes","entityLabel","entityId")

As a result, I have (for the first term only):

  id                                     types
1  1 Q5, dbPedia.Person, être humain, personne
2  1          Q11424, dbPedia.Film, film, film
                                                                                               subTypes       entityLabel entityId
1 Q1930187, Q36180, Q15949613, Q6625963, Q214917, journaliste, écrivain, nouvelliste, romancier, dramaturge Guy de Maupassant    Q9327
2                                                                                                      NULL Guy de Maupassant Q3122270

Notice that for the second result (movie), I do not have any subType. Moreover, the length of the listed elements within types or subTypes may vary according to the search term.

Now, I would like to unnest the lists in order to have a dataframe like this (sorry the format is not very comprehensive, but basically the idea is to have each {key,value} unnested in 2 columns with an incremental index):

   X_id X_source.types.id X_source.types.value X_source.types.id.1 X_source.types.value.1 X_source.subTypes.id
1 1                Q5          être humain      dbPedia.Person               personne             Q1930187
2 1            Q11424                 film        dbPedia.Film                   film                 <NA>
  X_source.subTypes.value X_source.subTypes.id.1 X_source.subTypes.value.1 X_source.subTypes.id.2 X_source.subTypes.value.2
1             journaliste                 Q36180                  écrivain              Q15949613               nouvelliste
2                    <NA>                   <NA>                      <NA>                   <NA>                      <NA>
  X_source.subTypes.id.3 X_source.subTypes.value.3 X_source.subTypes.id.4 X_source.subTypes.value.4    X_source.label X_source.id
1               Q6625963                 romancier                Q214917                dramaturge Guy de Maupassant       Q9327
2                   <NA>                      <NA>                   <NA>                      <NA> Guy de Maupassant    Q3122270

The conservation of the related ids is very important. I tried many things found here : Convert in R output of package Elastic (nested list?) to data.frame or JSON or here: Extract data from elasticsearch into R with elastic package, load into a data frame, error due to hits not expanding to the same length without any success...

Any idea to deal with it? I was wondering if I should transform the rearranged output (out2) or if it's better to come back to the original output (out)...

Thanks in advance!

PS : here is the dput version of the "out" (from df Search):

> dput(out, control="useSource")
list(list(`_index` = c("alias_fr", "alias_fr"), `_type` = c("triplet", 
"triplet"), `_id` = c("Q9327", "Q3122270"), `_score` = c(13.180366, 
13.178474), `_source.types` = list(list(id = c("Q5", "dbPedia.Person"
), value = c("être humain", "personne")), list(id = c("Q11424", 
"dbPedia.Film"), value = c("film", "film"))), `_source.subTypes` = list(
    list(id = c("Q1930187", "Q36180", "Q15949613", "Q6625963", 
    "Q214917"), value = c("journaliste", "écrivain", "nouvelliste", 
    "romancier", "dramaturge")), NULL), `_source.label` = c("Guy de Maupassant", 
"Guy de Maupassant"), `_source.id` = c("Q9327", "Q3122270")), 
    list(`_index` = "alias_fr", `_type` = "triplet", `_id` = "Q193504", 
        `_score` = 13.18018, `_source.types` = list(list(id = c("Q5", 
        "dbPedia.Person"), value = c("être humain", "personne"
        ))), `_source.subTypes` = list(list(id = c("Q33999", 
        "Q10800557", "Q3282637", "Q2526255", "Q28389"), value = c("acteur", 
        "acteur de cinéma", "producteur de cinéma", "réalisateur", 
        "scénariste"))), `_source.label` = "Vincent Cassel", 
        `_source.id` = "Q193504"))

And the same for out2 :

> dput(out2, control="useSource")
list(id = c("1", "1", "2"), types = list(list(id = c("Q5", "dbPedia.Person"
), value = c("être humain", "personne")), list(id = c("Q11424", 
"dbPedia.Film"), value = c("film", "film")), list(id = c("Q5", 
"dbPedia.Person"), value = c("être humain", "personne"))), subTypes = list(
    list(id = c("Q1930187", "Q36180", "Q15949613", "Q6625963", 
    "Q214917"), value = c("journaliste", "écrivain", "nouvelliste", 
    "romancier", "dramaturge")), NULL, list(id = c("Q33999", 
    "Q10800557", "Q3282637", "Q2526255", "Q28389"), value = c("acteur", 
    "acteur de cinéma", "producteur de cinéma", "réalisateur", 
    "scénariste"))), entityLabel = c("Guy de Maupassant", "Guy de Maupassant", 
"Vincent Cassel"), entityId = c("Q9327", "Q3122270", "Q193504"
))
Tau
  • 173
  • 1
  • 8
  • Can you `dput` say the first few rows of your dataset or create a sample dataset that is representative of your data? – A5C1D2H2I1M1N2O1R2T1 Dec 18 '17 at 17:37
  • I have edited the post adding the dput for the 2 first terms of my dataset – Tau Dec 19 '17 at 09:23
  • does `do.call(bind_rows, out)` helps? your data are very hard to understand. – Bastien Dec 22 '17 at 12:32
  • nope. Already tried this but nothing change in the output, the key,values are still nested... It splits well the key,value pairs when I apply it to one nested column: do.call(rbind,out$`_source.types`), but each pair is on 1 line, so I loose the attachment to the key variable (label) – Tau Dec 22 '17 at 15:05
  • I might be not very clear in the decription of my data. I try again : `class(out) [1] "data.table" "data.frame"` In the output, for a given row, I have columns with 1 value (for instance `Guy de Maupassant`and some with value being a dataframe ``. These data.frame values are a list of key,values pairs... Hope it clarifies a bit – Tau Dec 22 '17 at 17:54

1 Answers1

0

I finally manage to solve the problem, thanks to this post and some transformation steps. The solution is not very elegant though, but it works:

out_bind <- bind_rows(out, .id = "id")

#transform to data table in order to apply rbindlist
out <- as.data.table(out_bind) 
#rbindlist for "types" variable
out_nest1 <- rbindlist(out$types, fill = T, id = "row")[, entityId := out$entityId[row]][]
#rbindlist to "subTypes variable (choosing another id name -row1-, if not Rstudio was crashing!)
out_nest2 <- rbindlist(out$subTypes, fill = T, id = "row1")[, entityId := out$entityId[row1]][]

#finally joining the whole data
out <- full_join(out,out_nest1,by="entityId")
out <- full_join(out,out_nest2,by="entityId")

Now I can spend a good christmas ;)

Edit: the crash was not due to the id name, but to a data.table issue, solved here.

Tau
  • 173
  • 1
  • 8