1

I know the subject is widely covers but I didn't find the code working for my case... I have a dataframe of this type:

    V1             V2                V3   
 1: label1         alias_fr          alias_fr
 2: label1         triplet           triplet
 3: label1         Q9327             Q3122270
 4: label2         NULL              NULL
 5: label3         alias_fr          NULL
 6: label3         triplet           NULL
 7: label3         Q678              NULL

This dataframe is generated after mapping a json output to a query from a df input:

df <- Map(rbind, originalDF$input,out) #I first used Map(c,..) but it seems to be more difficult to reshape than rbind
df <- rbind.fill(lapply(df,function(y){as.data.frame(t(y),stringsAsFactors=FALSE)}))
class(df)
[1] "data.frame"

The example given is simplified though, as I have more than 3 columns, and some values are lists. Nevertheless when I have non-NULL values for a label I have always the same number of rows within a column (3 in my example: alias_fr, triplet, Qxx).

And I would like to have V2 and V3 values in row for each V1 value:

V1       var1       var2      var3
label1   alias_fr   triplet   Q9327
label1   alias_fr   triplet   Q3122270
label2   NULL       NULL      NULL  
label3   alias_fr   triplet   Q678 

I try to start melt: melt(df,id="V1"), but then I am stuck. I also tried reshape, cast, dcast, without any success, and I am more and more confuse with all reshaping stuff... If a reshape master is around, I would be very gratefull ;)

[Edit]: real objects to clarify my issue

Ok so this is an extract of the real dataset I’m working with:

#original dataset (actually it’s one column of the dataset)
originalDF <- c("Guy de Maupassant", "J.-J. Goldman", "Poitou-Charentes")

#output of the API query from the text in the orginalDF
out <- list(structure(list(`_index` = c("alias_fr", "alias_fr"), `_type` = c("triplet", 
"triplet"), `_id` = c("Q9327", "Q3122270"), `_score` = c(NA, 
NA), sort = list(-4.95263021255079, -6.65910164747673), `_source.types` = list(
    structure(list(id = c("Q5", "dbPedia.Person"), value = c("être humain", 
    "personne")), .Names = c("id", "value"), class = "data.frame", row.names = 1:2), 
    structure(list(id = c("Q11424", "dbPedia.Film"), value = c("film", 
    "film")), .Names = c("id", "value"), class = "data.frame", row.names = 1:2)), 
    `_source.pageRank` = c(-4.95263021255079, -6.65910164747673
    ), `_source.subTypes` = list(structure(list(id = c("Q1930187", 
    "Q36180", "Q15949613", "Q6625963", "Q214917"), value = c("journaliste", 
    "écrivain", "nouvelliste", "romancier", "dramaturge")), .Names = c("id", 
    "value"), class = "data.frame", row.names = c(NA, 5L)), NULL), 
    `_source.label` = c("Guy de Maupassant", "Guy de Maupassant"
    ), `_source.id` = c("Q9327", "Q3122270")), .Names = c("_index", 
"_type", "_id", "_score", "sort", "_source.types", "_source.pageRank", 
"_source.subTypes", "_source.label", "_source.id"), class = "data.frame", row.names = 1:2), 
    list(), structure(list(`_index` = "alias_fr", `_type` = "triplet", 
        `_id` = "Q17009", `_score` = NA, sort = list(-5.0448283638424), 
        `_source.types` = list(structure(list(id = "Q22670030", 
            value = "ancienne région française"), .Names = c("id", 
        "value"), class = "data.frame", row.names = 1L)), `_source.pageRank` = -5.0448283638424, 
        `_source.label` = "Poitou-Charentes", `_source.id` = "Q17009"), .Names = c("_index", 
    "_type", "_id", "_score", "sort", "_source.types", "_source.pageRank", 
    "_source.label", "_source.id"), class = "data.frame", row.names = 1L))

#df object (generated from Map, then rbind.fill)
df <- structure(list(V1 = list("Guy de Maupassant", "Guy de Maupassant", 
    "Guy de Maupassant", "Guy de Maupassant", "Guy de Maupassant", 
    "Guy de Maupassant", "Guy de Maupassant", "Guy de Maupassant", 
    "Guy de Maupassant", "Guy de Maupassant", "J.-J. Goldman", 
    "Poitou-Charentes", "Poitou-Charentes", "Poitou-Charentes", 
    "Poitou-Charentes", "Poitou-Charentes", "Poitou-Charentes", 
    "Poitou-Charentes", "Poitou-Charentes", "Poitou-Charentes"), 
    V2 = list("alias_fr", "triplet", "Q9327", NA_character_, 
        -4.95263021255079, structure(list(id = c("Q5", "dbPedia.Person"
        ), value = c("être humain", "personne")), .Names = c("id", 
        "value"), class = "data.frame", row.names = 1:2), "-4.95263021255079", 
        structure(list(id = c("Q1930187", "Q36180", "Q15949613", 
        "Q6625963", "Q214917"), value = c("journaliste", "écrivain", 
        "nouvelliste", "romancier", "dramaturge")), .Names = c("id", 
        "value"), class = "data.frame", row.names = c(NA, 5L)), 
        "Guy de Maupassant", "Q9327", NULL, "alias_fr", "triplet", 
        "Q17009", NA_character_, -5.0448283638424, structure(list(
            id = "Q22670030", value = "ancienne région française"), .Names = c("id", 
        "value"), class = "data.frame", row.names = 1L), "-5.0448283638424", 
        "Poitou-Charentes", "Q17009"), V3 = list("alias_fr", 
        "triplet", "Q3122270", NA_character_, -6.65910164747673, 
        structure(list(id = c("Q11424", "dbPedia.Film"), value = c("film", 
        "film")), .Names = c("id", "value"), class = "data.frame", row.names = 1:2), 
        "-6.65910164747673", NULL, "Guy de Maupassant", "Q3122270", 
        NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, 
        NULL), V4 = list(NULL, NULL, NULL, NULL, NULL, NULL, 
        NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, 
        NULL, NULL, NULL, NULL, NULL)), .Names = c("V1", "V2", 
"V3", "V4"), row.names = c(NA, 20L), class = "data.frame")
Tau
  • 173
  • 1
  • 8
  • it might be helpful to create a sample dataframe – tjebo Jan 08 '18 at 22:56
  • The printed version of this object appears to be something other than a dataframe. ?data.table?. ?tibble? – IRTFM Jan 08 '18 at 22:59
  • How does your real data look like? E.g., do you always have a maximum of 3 rows per label? – tjebo Jan 09 '18 at 00:13
  • I have edited the post with more information: yes, it is a dataframe, and the maximum rows per label is fixed (when non NULL values), see details in the post – Tau Jan 09 '18 at 08:29

1 Answers1

0

For your particular example (each label has a maximum of three rows), one approach would be

require(dplyr)
df <- data.frame(label = c(rep('a',3),'b', rep('c',3)), id1 = c(1,2,3,NA, 1,2,4), id2 = c(1,2,5,NA,NA,NA,NA))

#I used different names than you, because I have the impression that each column belongs to another measurement or so 

df1 <- select(df, label, id1)#subsets your df into two data frames
df2 <- select(df, label, id2)       

df1a <- df1 %>% group_by(label) %>% summarise (var1 = id1[1], var2 = id1[2], var3 = id1[3])
df2b <- df2 %>% group_by(label) %>% summarise (var1 = id2[1], var2 = id2[2], var3 = id2[3])

#this groups the columns after your label and then you can force the creation of NA values for the observations that do not have a row in your df

df_final <- rbind(df1a,df2b) 

> df_final
# A tibble: 6 x 4
   label  var1  var2  var3
  <fctr> <dbl> <dbl> <dbl>
1      a     1     2     3
2      b    NA    NA    NA
3      c     1     2     4
4      a     1     2     5
5      b    NA    NA    NA
6      c    NA    NA    NA

I know that this is not elegant, and not generalisable, because you manually assign a new column for each row that you have/ do not have in your df, but it should work with your example.

tjebo
  • 21,977
  • 7
  • 58
  • 94
  • Thanks Tjebo, your solution should work indeed, the problem is I have more than 3 rows per label and more than 3 columns in myr real dataframe, so it's hard to generalise as you mentioned... – Tau Jan 09 '18 at 08:34
  • I've tried your solution, but as I have some lists in my values, the grouping doesn't work: `Error in grouped_df_impl(data, unname(vars), drop) : Column `V1` can't be used as a grouping variable because it's a list` – Tau Jan 09 '18 at 08:48
  • The list values are for instance: `list(id = c("Q5", "dbPedia.Person"), value = c("être humain", "personne"))`... I may have to unlist first? Your solution is interesting because I can select the rows I want to have in the final output... – Tau Jan 09 '18 at 09:08
  • Sorry I didn't get the list part - you have lists within your data frame? I think it would actually help if you could create a data sample for us on which we could work- maybe it could also help to go back a step and show the original data frame and let us know what you need to extract? – tjebo Jan 09 '18 at 09:35
  • I have edited the post to give you the real structure of the objects I am working with – Tau Jan 09 '18 at 11:14
  • Thanks. However, I am unfortunately not able to help with my little skills. It looks like rather messy data to me though, but let other people decide... I personally would try to create tidy dataframes (one observation per row, one variable per column). Maybe rethink your data query from the original database you seem to use. ? Sorry to be of so little help – tjebo Jan 09 '18 at 14:14
  • Thanks anyway Tjebo. It's messy data indeed, but I do not have choice for the API requesting, so I have to deal with it... Unfortunately I'm still a newbie, thus it's not easy... – Tau Jan 09 '18 at 16:55