I have a txt file with the following sample data:
id,001
v1,some_value
id,002
v2,some_value
v2,some_value
id,003
v2,some_value
id,004
v4,some_value
In fact, the original data is in xml/json format. But the data has been flatten. So the order of the values is important.
The idea is to get the structured data as below:
I have to R code that works as follows:
txt <- "
id,001
v1,some_value
id,002
v2,some_value
id,003
v2,some_value
id,004
v4,some_value"
existing_list <- c(id = "", v1 = "", v2 = "", v3 = "", v4 = "")
df=read.csv(textConnection(txt),header = F,stringsAsFactors = F)
id_list <- split(df, cumsum(df$V1 == "id"))
do.call(rbind, lapply(id_list, function(x) {
vec <- setNames(x$V2, x$V1)
existing_list[match(names(vec), names(existing_list))] <- vec
as.data.frame(as.list(existing_list))
}))
The problem is that it does not work for the following data
txt <- "
id,001
v1,some_value
id,002
v2,some_value
v2,some_value
id,003
v2,some_value
id,004
v4,some_value"
So my question is how to modify the R code to make it work for the second dataset.
Another aproach would be to convert the flatten txt data to json, then with a package like rjson
is would be easy to parse the data. But I have no idea how to do it.
{
"items": [
{
"id": "001",
"attributes": [
{
"v1": "some_value"
}
]
},
{
"id": "002",
"attributes": [
{
"v2": "some_value"
},
{
"v2": "some_value"
}
]
},
{
"id": "003",
"attributes": [
{
"v2": "some_value"
}
]
}
]
}
[update] akrun provided a very useful answer, but then I realized that the structure can be nested.
txt <- "id,001
v1,some_value
id,002
v1,some_value
subid,002001
v2,valuev2_1
subid,002002"
This is to be transform into
the red part to be completed.
And with the answer akrun provided, I think that we would not be able to distinguish the previous data from this one:
txt <- "id,001
v1,some_value
id,002
v1,some_value
subid,002001
subid,002002
v2,valuev2_1"
Because when examing the columns of the tibble, we have the same:
So the ideal solution would be to convert the csv to json. With the hierachical structure of the keys provided of course. But maybe I am wrong.
One step to be accomplished is to transform the tibble with list-cols into a tibble with normal columns.