0

I am using jsonlite::fromJSON to get a JSON document to a dataframe. The problem is that it is not getting flattened completely - for reasons not known to me (but someone below has mentioned because they are in the form of a dictionary, they won't get flattened).

df <- jsonlite::fromJSON(out, flatten = TRUE, simplifyDataFrame = TRUE)
n <- df$hits$total # 1 entry
dat <- df$hits$hits

# dat is a dataframe
dim(dat)
[1]  2 5

# class of each column in the dataframe
lapply(dat, class)

$`_index`
[1] "character"

$`_type`
[1] "character"

$`_id`
[1] "character"

$`_score`
[1] "numeric"

$`_source.samples`
[1] "list"

There is a column dat$_source.samples that is actually a list of dataframes:

> dim(dat$`_source.samples`[[1]])
[1] 21  2
> dim(dat$`_source.samples`[[2]])
[1] 21  2

How do I expand the column containing the dataframes such that they form two new columns within the existing dataframe - and duplicate the first four rows as a result of expansion. Here is an example:

Rstudio screenshot of how dat looks: Rstudio screenshot of dat

# the first four columns look like this
> head(dat[,1:4])
  _index _type                _id _score
1   pnoc genes ENSG00000131051.20      1
2   pnoc genes ENSG00000000457.13      1

# the fifth column `_source.samples` that has dataframes looks like this 
# just showing the dataframe in the first row of `dat`
> head(dat$`_source.samples`[[1]])
                        sample_id rsem.fpkm
1 C021_0001_20140916_tumor_RNASeq     39.11
2        CPBT_0001_1_tumor_RNASeq    184.56
3        CPBT_0007_1_tumor_RNASeq     41.29
4   C021_0010_001774_tumor_RNASeq     86.31
5   C021_0003_001409_tumor_RNASeq     79.24
6        CPBT_0005_1_tumor_RNASeq     66.20

So I would want something like this:

 _index _type                _id _score                       sample_id
1   pnoc genes ENSG00000000457.13      1 C021_0001_20140916_tumor_RNASeq
2   pnoc genes ENSG00000000457.13      1        CPBT_0001_1_tumor_RNASeq
3   pnoc genes ENSG00000000457.13      1        CPBT_0007_1_tumor_RNASeq
4   pnoc genes ENSG00000000457.13      1   C021_0010_001774_tumor_RNASeq
5   pnoc genes ENSG00000000457.13      1   C021_0003_001409_tumor_RNASeq
6   pnoc genes ENSG00000000457.13      1        CPBT_0005_1_tumor_RNASeq
  rsem.fpkm
1      1.39
2      5.58
3      1.93
4      3.64
5      5.20
6      3.69

Here is a reproducible dataset:

> dput(dat)
structure(list(`_index` = c("pnoc", "pnoc"), `_type` = c("genes", 
"genes"), `_id` = c("ENSG00000131051.20", "ENSG00000000457.13"
), `_score` = c(1, 1), `_source.samples` = list(structure(list(
    sample_id = c("C021_0001_20140916_tumor_RNASeq", "CPBT_0001_1_tumor_RNASeq", 
    "CPBT_0007_1_tumor_RNASeq", "C021_0010_001774_tumor_RNASeq", 
    "C021_0003_001409_tumor_RNASeq", "CPBT_0005_1_tumor_RNASeq", 
    "CPBT_0008_1_tumor_RNASeq", "C021_0002_001113_tumor_RNASeq", 
    "C021_0013_001872_tumor_RNASeq", "C021_0005_001661_tumor_RNASeq", 
    "C021_0007_001669_tumor_RNASeq", "C021_0008_001699_tumor_RNASeq", 
    "CPBT_0006_1_tumor_RNASeq", "C021_0011_001786_tumor_RNASeq", 
    "C021_0009_001766_tumor_RNASeq", "CPBT_0004_1_tumor_RNASeq", 
    "CPBT_0003_1_tumor_RNASeq", "CPBT_0009_1_tumor_RNASeq", "C021_0006_001666_tumor_RNASeq", 
    "C021_0012_001825_tumor_RNASeq", "C021_0004_001418_tumor_RNASeq"
    ), rsem.fpkm = c(39.11, 184.56, 41.29, 86.31, 79.24, 66.2, 
    42.13, 88.78, 78.73, 96.79, 38.5, 105.12, 129.16, 145.13, 
    117.96, 86.53, 75.43, 179.01, 0, 61.61, 98.64)), .Names = c("sample_id", 
"rsem.fpkm"), class = "data.frame", row.names = c(NA, 21L)), 
    structure(list(sample_id = c("C021_0001_20140916_tumor_RNASeq", 
    "CPBT_0001_1_tumor_RNASeq", "CPBT_0007_1_tumor_RNASeq", "C021_0010_001774_tumor_RNASeq", 
    "C021_0003_001409_tumor_RNASeq", "CPBT_0005_1_tumor_RNASeq", 
    "CPBT_0008_1_tumor_RNASeq", "C021_0002_001113_tumor_RNASeq", 
    "C021_0013_001872_tumor_RNASeq", "C021_0005_001661_tumor_RNASeq", 
    "C021_0007_001669_tumor_RNASeq", "C021_0008_001699_tumor_RNASeq", 
    "CPBT_0006_1_tumor_RNASeq", "C021_0011_001786_tumor_RNASeq", 
    "C021_0009_001766_tumor_RNASeq", "CPBT_0004_1_tumor_RNASeq", 
    "CPBT_0003_1_tumor_RNASeq", "CPBT_0009_1_tumor_RNASeq", "C021_0006_001666_tumor_RNASeq", 
    "C021_0012_001825_tumor_RNASeq", "C021_0004_001418_tumor_RNASeq"
    ), rsem.fpkm = c(1.39, 5.58, 1.93, 3.64, 5.2, 3.69, 1.75, 
    5.38, 3.46, 4.14, 0.96, 3.93, 4.47, 3.17, 4.38, 2.8, 2.27, 
    7.4, 0, 2.76, 5.55)), .Names = c("sample_id", "rsem.fpkm"
    ), class = "data.frame", row.names = c(NA, 21L)))), .Names = c("_index", 
"_type", "_id", "_score", "_source.samples"), class = "data.frame", row.names = 1:2)

Thanks!

Komal Rathi
  • 4,164
  • 13
  • 60
  • 98
  • That google drive link probably won't be around forever... could you provide a minimal example using `dput` in the text? –  Mar 28 '17 at 23:51
  • @dash2 I will update the question with a reproducible example. – Komal Rathi Mar 29 '17 at 00:35

1 Answers1

2

These columns do not get flattened, because they're originally embedded dictionaries within your json file.

Here's what you can do, though it's not totally intuitive:

First, you create a new dataframe with the content of your column _source.samples. You need to make sure you keep the _id column to identify which row each data comes from.

samples <- mapply(function(x, y) cbind(x, _id = y), dat$`_source.samples, dat$_id, SIMPLIFY = FALSE)

Then you merge it with your dat.

merge(dat, rbind.pages(samples), by= "_id")
Laurent
  • 1,914
  • 2
  • 11
  • 25
  • How do I specify that during the import? My goal is to expand the column that contains the dictionaries (which basically become dataframes on import) into separate rows. – Komal Rathi Mar 29 '17 at 00:27
  • Umm if you look at my code in the question details - I am already using `jsonlite::fromJSON(out, flatten = TRUE, simplifyDataFrame = TRUE)` – Komal Rathi Mar 29 '17 at 13:15
  • Hope this helps – Laurent Mar 29 '17 at 16:39