Data parsing with R flatten json data

Question

I have a txt file with the following sample data:

id,001
v1,some_value
id,002
v2,some_value
v2,some_value
id,003
v2,some_value
id,004
v4,some_value

In fact, the original data is in xml/json format. But the data has been flatten. So the order of the values is important.

The idea is to get the structured data as below:

I have to R code that works as follows:

txt <- "
id,001
v1,some_value
id,002
v2,some_value
id,003
v2,some_value
id,004
v4,some_value"

existing_list <- c(id = "", v1 = "", v2 = "", v3 = "", v4 = "")

df=read.csv(textConnection(txt),header = F,stringsAsFactors = F)

id_list <- split(df, cumsum(df$V1 == "id"))

do.call(rbind, lapply(id_list, function(x) {
  vec <- setNames(x$V2, x$V1)
  existing_list[match(names(vec), names(existing_list))] <- vec
  as.data.frame(as.list(existing_list))
  }))

The problem is that it does not work for the following data

txt <- "
id,001
v1,some_value
id,002
v2,some_value
v2,some_value
id,003
v2,some_value
id,004
v4,some_value"

So my question is how to modify the R code to make it work for the second dataset.

Another aproach would be to convert the flatten txt data to json, then with a package like rjson is would be easy to parse the data. But I have no idea how to do it.

{
  "items": [
    {
      "id": "001",
      "attributes": [
        {
          "v1": "some_value"
        }
      ]
    },
    {
      "id": "002",
      "attributes": [
        {
          "v2": "some_value"
        },
        {
          "v2": "some_value"
        }
      ]
    },
    {
      "id": "003",
      "attributes": [
        {
          "v2": "some_value"
        }
      ]
    }
  ]
}

[update] akrun provided a very useful answer, but then I realized that the structure can be nested.

txt <- "id,001
v1,some_value
id,002
v1,some_value
subid,002001
v2,valuev2_1
subid,002002"

This is to be transform into

the red part to be completed.

And with the answer akrun provided, I think that we would not be able to distinguish the previous data from this one:

txt <- "id,001
v1,some_value
id,002
v1,some_value
subid,002001
subid,002002
v2,valuev2_1"

Because when examing the columns of the tibble, we have the same:

So the ideal solution would be to convert the csv to json. With the hierachical structure of the keys provided of course. But maybe I am wrong.

One step to be accomplished is to transform the tibble with list-cols into a tibble with normal columns.

akrun · Answer 1 · 2022-03-26T20:06:28.790

2

We may reshape to 'wide' format with pivot_wider

library(dplyr)
library(tidyr)
out <- df %>%
    group_by(grp = cumsum(V1 == 'id')) %>%
    mutate(id = first(V2)) %>%
    ungroup %>%
    filter(V1 != 'id') %>% 
    pivot_wider(names_from = V1, values_from = V2)

For the second example

library(purrr)
split(df, cumsum(df$V1 == "id")) %>%
   map_dfr(~ {
       x1 <- split(.x$V2, .x$V1)
       mx <- max(lengths(x1))
     map_dfr(x1, `length<-`, mx)}) %>% 
  fill(id, v1, .direction = "downup")

-output

# A tibble: 3 × 4
  id    v1         subid  v2       
  <chr> <chr>      <chr>  <chr>    
1 001   some_value <NA>   <NA>     
2 002   some_value 002001 valuev2_1
3 002   some_value 002002 <NA>

edited Mar 26 '22 at 20:06

answered Mar 26 '22 at 17:34

akrun

874,273
37
540
662

1

Master I need your help. If you have time (and only if it is suitable for you) please take a look here: – TarJae Mar 26 '22 at 17:56
Thank you akrun, I update my question. the dataset can be nested, and the ideal structure is without list cols. – John Smith Mar 26 '22 at 19:42
@JohnSmith in the update you have `002001`. That looks weird. How do you separate ids like `123421` which can be either 123, and 421 or 12, and 3421 ? – akrun Mar 26 '22 at 19:44
the values of the subid can be anything, and I don't have to separate subids. I add subid, different from the previous ids – John Smith Mar 26 '22 at 19:48
1

@JohnSmith Based on the example showed, perhaps the update helps? – akrun Mar 26 '22 at 19:52
Thank you akrun, I run the code but the second subid is not extracted. – John Smith Mar 26 '22 at 19:56
@akrun thank you very much, this is actually very useful. However since there can be millions of text files with hundreds of keys organized in a nested structure with 5 levels. For efficiency, maybe it is better to convert the csv to json I created [another question](https://stackoverflow.com/questions/71631941/parsing-flatten-json-with-r-or-python). Because the dataframe would be huge. Or we have to create relational tables... – John Smith Mar 26 '22 at 21:36
@JohnSmith with your example I get the second subid though – akrun Mar 27 '22 at 14:58

Parfait · Answer 2 · 2022-03-26T22:52:05.917

Going the JSON build approach, consider migrating text data to data frame and walk down the rows:

Input

library(jsonlite)

txt <- "
id,001
v1,some_value
id,002
v2,some_value
v2,some_value
id,003
v2,some_value
id,004
v4,some_value"

Process

# BUILD DATA FRAME FROM TEXT
lines_df <- read.csv(text=txt, header=FALSE)

# BUILD NESTED LIST
lines_lst <- list(items = list())
for(row in 1:nrow(lines_df)) {   
   if(lines_df$V1[row] == "id"){
     lines_lst$items[[row]] <- list(id = lines_df$V2[row])
     lines_lst$items[[row]]$attributes <- list()
     curr <- row
     i <- 1
   }  else {
     lines_lst$items[[curr]]$attributes[[i]] <- setNames(
       list(lines_df$V2[row]), lines_df$V1[row]
     )
     i <- i + 1
   }
}

# REMOVE NULLs
lines_lst$items <- Filter(length, lines_lst$items)

# OUTPUT TO JSON
json_output <- toJSON(lines_lst, pretty=TRUE)

Output

json_output
{
  "items": [
    {
      "id": ["001"],
      "attributes": [
        {
          "v1": ["some_value"]
        }
      ]
    },
    {
      "id": ["002"],
      "attributes": [
        {
          "v2": ["some_value"]
        },
        {
          "v2": ["some_value"]
        }
      ]
    },
    {
      "id": ["003"],
      "attributes": [
        {
          "v2": ["some_value"]
        }
      ]
    },
    {
      "id": ["004"],
      "attributes": [
        {
          "v4": ["some_value"]
        }
      ]
    }
  ]
}

Very nice Parfait. What if the structure is more nested, as indicated in the example in the update. More generally, I have a file in indicate the parent category of one id. Can we generalize with this information? for example, the parent category of subid is id. — John Smith, Mar 26 '22 at 20:46
I create another question [here](https://stackoverflow.com/questions/71631941/parsing-flatten-json-with-r-or-python), this approach is preferred because the dataset is complex, and there are millions of texts files. so it would nice to convert them all to json — John Smith, Mar 26 '22 at 21:28

Data parsing with R flatten json data

2 Answers2