EDIT 1: the problem in more simple terms (for the whole issue, check the Original Edit)
How can I unlist a list of key, values pairs within a dataframe, knowing that the number of pairs may vary.
For instance:
_source.types _source.label
1 key1, key2, value1, value2 label1
2 NULL label1
3 key3, value3 label2
Note that (key1, key2, value1, value2) is a <data.frame>
Expected result:
types.k1 types.v1 types.k2 types.v2 label
1 key1 value1 key2 value2 label1
2 NULL NULL NULL NULL label1
3 key3 value3 NULL NULL label2
I've tried unnest
, unlist
,... without success as I have always an error due to the number of elements or the class of the object.
ORIGINAL EDIT I have a result from a Search request to an elastic search base, using elastic package. As the query is a loop from terms within a pre-existing dataframe, I have a list of responses for each term.
#existing dataframe
df <- data.frame(id=c("1","2"),terms=(c("Guy de Maupassant","Vincent Cassel")))
#loop query to ES
query_es <- '{"_source": ["id", "label", "types", "subTypes"],
"query":{"bool":{"must":[{"term":{"label":"%s"}}]}}}'
out = list()
for (i in seq_along(df$terms)) {
out[[i]] <- Search(index = "index_1",
body = sprintf(query_es, df$terms[i]),
size = 3, asdf=TRUE)$hits$hits
}
The result is a list of lists like this (I just display the first result for clarity) :
[[1]]
_index _type _id _score _source.types
1 index_1 triplet Q9327 13.18037 Q5, dbPedia.Person, être humain, personne
2 index_1 triplet Q3122270 13.17847 Q11424, dbPedia.Film, film, film
_source.subTypes _source.label _source.id
1 Q1930187, Q36180, Q15949613, Q6625963, Q214917, journaliste, écrivain, nouvelliste, romancier, dramaturge Guy de Maupassant Q9327
2 NULL Guy de Maupassant Q3122270
As you can see, I have 2 possible results for the first term: a writer or a movie, each one having a list of {id,value} for the types and the subTypes. In order to have a more comprehensive view, I re-arrange the output:
out2 <- bind_rows(out, .id = "id")
out2 <- out_i_bind2[,-c(2:5)]
colnames(out2) <- c("id","types","subTypes","entityLabel","entityId")
As a result, I have (for the first term only):
id types
1 1 Q5, dbPedia.Person, être humain, personne
2 1 Q11424, dbPedia.Film, film, film
subTypes entityLabel entityId
1 Q1930187, Q36180, Q15949613, Q6625963, Q214917, journaliste, écrivain, nouvelliste, romancier, dramaturge Guy de Maupassant Q9327
2 NULL Guy de Maupassant Q3122270
Notice that for the second result (movie), I do not have any subType. Moreover, the length of the listed elements within types or subTypes may vary according to the search term.
Now, I would like to unnest the lists in order to have a dataframe like this (sorry the format is not very comprehensive, but basically the idea is to have each {key,value} unnested in 2 columns with an incremental index):
X_id X_source.types.id X_source.types.value X_source.types.id.1 X_source.types.value.1 X_source.subTypes.id
1 1 Q5 être humain dbPedia.Person personne Q1930187
2 1 Q11424 film dbPedia.Film film <NA>
X_source.subTypes.value X_source.subTypes.id.1 X_source.subTypes.value.1 X_source.subTypes.id.2 X_source.subTypes.value.2
1 journaliste Q36180 écrivain Q15949613 nouvelliste
2 <NA> <NA> <NA> <NA> <NA>
X_source.subTypes.id.3 X_source.subTypes.value.3 X_source.subTypes.id.4 X_source.subTypes.value.4 X_source.label X_source.id
1 Q6625963 romancier Q214917 dramaturge Guy de Maupassant Q9327
2 <NA> <NA> <NA> <NA> Guy de Maupassant Q3122270
The conservation of the related ids is very important. I tried many things found here : Convert in R output of package Elastic (nested list?) to data.frame or JSON or here: Extract data from elasticsearch into R with elastic package, load into a data frame, error due to hits not expanding to the same length without any success...
Any idea to deal with it? I was wondering if I should transform the rearranged output (out2) or if it's better to come back to the original output (out)...
Thanks in advance!
PS : here is the dput version of the "out" (from df Search):
> dput(out, control="useSource")
list(list(`_index` = c("alias_fr", "alias_fr"), `_type` = c("triplet",
"triplet"), `_id` = c("Q9327", "Q3122270"), `_score` = c(13.180366,
13.178474), `_source.types` = list(list(id = c("Q5", "dbPedia.Person"
), value = c("être humain", "personne")), list(id = c("Q11424",
"dbPedia.Film"), value = c("film", "film"))), `_source.subTypes` = list(
list(id = c("Q1930187", "Q36180", "Q15949613", "Q6625963",
"Q214917"), value = c("journaliste", "écrivain", "nouvelliste",
"romancier", "dramaturge")), NULL), `_source.label` = c("Guy de Maupassant",
"Guy de Maupassant"), `_source.id` = c("Q9327", "Q3122270")),
list(`_index` = "alias_fr", `_type` = "triplet", `_id` = "Q193504",
`_score` = 13.18018, `_source.types` = list(list(id = c("Q5",
"dbPedia.Person"), value = c("être humain", "personne"
))), `_source.subTypes` = list(list(id = c("Q33999",
"Q10800557", "Q3282637", "Q2526255", "Q28389"), value = c("acteur",
"acteur de cinéma", "producteur de cinéma", "réalisateur",
"scénariste"))), `_source.label` = "Vincent Cassel",
`_source.id` = "Q193504"))
And the same for out2 :
> dput(out2, control="useSource")
list(id = c("1", "1", "2"), types = list(list(id = c("Q5", "dbPedia.Person"
), value = c("être humain", "personne")), list(id = c("Q11424",
"dbPedia.Film"), value = c("film", "film")), list(id = c("Q5",
"dbPedia.Person"), value = c("être humain", "personne"))), subTypes = list(
list(id = c("Q1930187", "Q36180", "Q15949613", "Q6625963",
"Q214917"), value = c("journaliste", "écrivain", "nouvelliste",
"romancier", "dramaturge")), NULL, list(id = c("Q33999",
"Q10800557", "Q3282637", "Q2526255", "Q28389"), value = c("acteur",
"acteur de cinéma", "producteur de cinéma", "réalisateur",
"scénariste"))), entityLabel = c("Guy de Maupassant", "Guy de Maupassant",
"Vincent Cassel"), entityId = c("Q9327", "Q3122270", "Q193504"
))