How to parse a JSON like structure for every row in a dataframe?

Question

I am trying to parse a a column within my dataframe that resembles a JSON-like structure into a new dataframe. Although, when I run my code, it is only outputting the first row of the dataframe.

How do I alter what I currently have so that it applies to every row within the dataframe? The string isn't always a consistent length... Sometimes there are many more pairs than others.

Sample dataframe:

df <- data.frame(
  col = c(
    "[{guid abc123-def456-bf1239435ahfs} {id <nil>} {start_timestamp 1688573993888} {end_timestamp <nil>} {active true}]",
    "[{guid xyz987-pqr654-lmno0987zxywu} {id <nil>} {start_timestamp 1688574000000} {active false}]"
  ),
  stringsAsFactors = FALSE
)

What I have tried so far:

# Extract key-value pairs from the string
pairs <- str_match_all(df$col, "\\{(.*?)\\s(.*?)\\}")[[1]]

# Convert the key-value pairs to a data frame
new_df <- as.data.frame(pairs[, -1], stringsAsFactors = FALSE)

wide_df <- pivot_wider(new_df, names_from = V1, values_from = V2)

Output:

Allan Cameron · Accepted Answer · 2023-07-18T13:43:19.663

You can use a bit of string parsing using strsplit. This keeps the output from each row in its own list element which you can lapply into a single-row data frame. Once you have all your single-row data frames you can bind them into a single data frame with the same number of rows as the original.

All this can be done in base R:

do.call(rbind,
  strsplit(df$col, '\\[\\{|\\} \\{|\\}\\]') |>
  lapply(function(x) strsplit(x[-1], ' ')) |>
  lapply(function(x) t(setNames(sapply(x, `[`, 2), sapply(x, `[`, 1)))) |>
  lapply(as.data.frame)
)
#>                          guid    id start_timestamp end_timestamp active
#> 1 abc123-def456-bf1239435ahfs <nil>   1688573993888         <nil>   true
#> 2 xyz987-pqr654-lmno0987zxywu <nil>   1688574000000         <nil>   true

the second `end_timestamp` should be `NA` – Onyambu Jul 18 '23 at 14:59 — Onyambu, Jul 18 '23 at 14:59

Onyambu · Answer 2 · 2023-07-18T15:14:55.623

Using base R:

a <- gsub("(\\w+) ([^{} ]+)", "\\1:\\2", df$col)
b <- gsub("[}{ ]+","\n", a)
read.dcf(textConnection(gsub("[][]+", "", b)), all = TRUE)
                         guid    id start_timestamp end_timestamp active
1 abc123-def456-bf1239435ahfs <nil>   1688573993888         <nil>   true
2 xyz987-pqr654-lmno0987zxywu <nil>   1688574000000          <NA>  false

Explanation:

Capture the first and second word, put a collon between them. Notice that the second part is to capture everything which does not include curly braces.
Replace one or more of the curly braces and the space between them with a line feed.
Remove the brackets and read the data into a dataframe:

using other packages:

You could structure your string to be a valid JSON object:

d <- gsub("(\\w+) ([^{} ]+)", '"\\1":"\\2"', df$col)
e <- sprintf("[%s]", toString(gsub("\\} \\{", ",", d)))
dplyr::bind_rows(jsonlite::fromJSON(e))

                       guid    id start_timestamp end_timestamp active
1 abc123-def456-bf1239435ahfs <nil>   1688573993888         <nil>   true
2 xyz987-pqr654-lmno0987zxywu <nil>   1688574000000          <NA>  false

Explanation:

Capture the first and second parts and encapsulate them with quotes separating them with a comma. ie "key":"value" pair.
Replace the } { with a comma.
Paste the string together and read it into R using jsonlite

How to parse a JSON like structure for every row in a dataframe?

2 Answers2