2

I am trying to transform a rather complicated and nested JSON file into a data frame using R. After trying a lot of different approaches I do not seem to get further with this. I feel like I just lack the knowledge of the right functions here.

An example for the JSON file can be fetched with the following code:

library(tidyverse)
library(httr)
library(jsonlite)

pdb_ids <- c("6HG1", "1E9I", "6D3Q")
pdb_ids <- paste0(pdb_ids, collapse = '", "')

query <- 'query={
entries(entry_ids: ["pdb_ids"]) {
rcsb_id
exptl {
method
}
struct_keywords {
pdbx_keywords
}
exptl_crystal_grow {
pH
temp
method
}
rcsb_binding_affinity {
comp_id
value
}
rcsb_entry_info {
experimental_method
assembly_count
resolution_combined
inter_mol_metalic_bond_count
}
pdbx_nmr_exptl_sample_conditions {
ionic_strength
pH
temperature
}
pdbx_nmr_refine {
method
}
pdbx_nmr_exptl {
type
}
polymer_entities {
entity_poly {
pdbx_seq_one_letter_code_can
}
rcsb_entity_source_organism{
ncbi_scientific_name
ncbi_taxonomy_id
}
rcsb_polymer_entity_container_identifiers {
entry_id
auth_asym_ids
}
rcsb_polymer_entity_align {
aligned_regions {
entity_beg_seq_id
ref_beg_seq_id
length
}
reference_database_accession
reference_database_name
}
uniprots {
rcsb_uniprot_container_identifiers {
uniprot_id
}
rcsb_uniprot_protein {
name {
value
}
}
}
}
nonpolymer_entities {
rcsb_nonpolymer_entity_container_identifiers {
auth_asym_ids
entry_id
}
nonpolymer_comp {
chem_comp {
id
type
formula_weight
name
formula
}
}
}
}
}'

full_query <- stringr::str_replace_all(query, pattern = "pdb_ids", replacement = pdb_ids)

url_encode_query <- utils::URLencode(full_query) %>% 
stringr::str_replace_all(pattern = "\\[", replacement = "%5B") %>% 
stringr::str_replace_all(pattern = "\\]", replacement = "%5D")

pdb_query_list <- httr::GET(httr::modify_url("https://data.rcsb.org/graphql", query = url_encode_query)) %>% 
httr::content(as = "text") %>% 
jsonlite::fromJSON()

I transform the list into a data frame.

pdb_query_df <- pdb_query_list %>% 
 as.data.frame(stringsAsFactors = FALSE) 

This data frame contains nested columns. I would like to unnest all of these columns to have a data frame without any list columns in the end. Some of these columns contain data that belong together.

The problem is that not every column is present for every PDB ID. Therefore, some columns contain NULL (example: data.entries.rcsb_binding_affinity). If I try to use unnest on those columns the columns with NULL disappear. I tried to replace them with NA but that does not work.

To make it easier, it would help to split the data frame in three and do the unnesting on the individual data frames and then join everything back together in the end.

info <- pdb_query_df %>%
  select(-c(data.entries.polymer_entities, data.entries.nonpolymer_entities)) 

polymer <- pdb_query_df %>%
  select(data.entries.polymer_entities) 

nonpolymer <- pdb_query_df %>%
  select(data.entries.nonpolymer_entities) 

However, I still do not know how to go on from here since everything I tried failed.

non_polymer_final <- nonpolymer %>% 
  unnest(data.entries.nonpolymer_entities)

For example, unnesting the relatively simple nonpolymer causes another weird phenomenon. The data frame has 2 variables however, when I look at it in RStudio with view(nonpolymer) it seems to have more columns with the right combinations in each row. I cannot figure out a way to transform this into a "normal" data frame that is not nested.

It would be great if someone has an idea of how to create a data frame that makes sense and is not nested out of this data!

jpquast
  • 333
  • 2
  • 8
  • More of a comment really, but the `ghql` package might be easier for running graph ql queries from R. There are a lot of irregularly shaped nested objects in the response, but there may be some `data.table` options.This SO answer seemed to get after what you are looking for: https://stackoverflow.com/a/28287905/5963303. There are also a few possibilities in this SO question: https://stackoverflow.com/q/48542874/5963303. Perhaps there's something there? – dcruvolo Nov 25 '20 at 23:00

0 Answers0