0

I am trying to parse a a column within my dataframe that resembles a JSON-like structure into a new dataframe. Although, when I run my code, it is only outputting the first row of the dataframe.

How do I alter what I currently have so that it applies to every row within the dataframe? The string isn't always a consistent length... Sometimes there are many more pairs than others.

Sample dataframe:

df <- data.frame(
  col = c(
    "[{guid abc123-def456-bf1239435ahfs} {id <nil>} {start_timestamp 1688573993888} {end_timestamp <nil>} {active true}]",
    "[{guid xyz987-pqr654-lmno0987zxywu} {id <nil>} {start_timestamp 1688574000000} {active false}]"
  ),
  stringsAsFactors = FALSE
)

What I have tried so far:

# Extract key-value pairs from the string
pairs <- str_match_all(df$col, "\\{(.*?)\\s(.*?)\\}")[[1]]

# Convert the key-value pairs to a data frame
new_df <- as.data.frame(pairs[, -1], stringsAsFactors = FALSE)

wide_df <- pivot_wider(new_df, names_from = V1, values_from = V2)

Output: enter image description here

matt
  • 13
  • 3

2 Answers2

1

You can use a bit of string parsing using strsplit. This keeps the output from each row in its own list element which you can lapply into a single-row data frame. Once you have all your single-row data frames you can bind them into a single data frame with the same number of rows as the original.

All this can be done in base R:

do.call(rbind,
  strsplit(df$col, '\\[\\{|\\} \\{|\\}\\]') |>
  lapply(function(x) strsplit(x[-1], ' ')) |>
  lapply(function(x) t(setNames(sapply(x, `[`, 2), sapply(x, `[`, 1)))) |>
  lapply(as.data.frame)
)
#>                          guid    id start_timestamp end_timestamp active
#> 1 abc123-def456-bf1239435ahfs <nil>   1688573993888         <nil>   true
#> 2 xyz987-pqr654-lmno0987zxywu <nil>   1688574000000         <nil>   true
Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
1

Using base R:

a <- gsub("(\\w+) ([^{} ]+)", "\\1:\\2", df$col)
b <- gsub("[}{ ]+","\n", a)
read.dcf(textConnection(gsub("[][]+", "", b)), all = TRUE)
                         guid    id start_timestamp end_timestamp active
1 abc123-def456-bf1239435ahfs <nil>   1688573993888         <nil>   true
2 xyz987-pqr654-lmno0987zxywu <nil>   1688574000000          <NA>  false

Explanation:

  • Capture the first and second word, put a collon between them. Notice that the second part is to capture everything which does not include curly braces.
  • Replace one or more of the curly braces and the space between them with a line feed.
  • Remove the brackets and read the data into a dataframe:

using other packages:

You could structure your string to be a valid JSON object:

d <- gsub("(\\w+) ([^{} ]+)", '"\\1":"\\2"', df$col)
e <- sprintf("[%s]", toString(gsub("\\} \\{", ",", d)))
dplyr::bind_rows(jsonlite::fromJSON(e))

                       guid    id start_timestamp end_timestamp active
1 abc123-def456-bf1239435ahfs <nil>   1688573993888         <nil>   true
2 xyz987-pqr654-lmno0987zxywu <nil>   1688574000000          <NA>  false

Explanation:

  • Capture the first and second parts and encapsulate them with quotes separating them with a comma. ie "key":"value" pair.
  • Replace the } { with a comma.
  • Paste the string together and read it into R using jsonlite
Onyambu
  • 67,392
  • 3
  • 24
  • 53