I am working with a dataframe that contains a column called stage that is in a custom string format. It seems to resemble JSON but isn't quite in that format. My initial thought was to try and convert it to JSON so then I could use some existing functions to convert it into the format I want.
I am trying to take that stage column, parse each of the key value pairs within, and make those into their own columns (key) and the values being in one row.
What I've tried so far:
library(stringr)
library(dplyr)
library(tidyr)
# Sample dataframe
df <- data.frame(Stage = c("[{stage_type drive} {guid 1234a-f3f0-123b-c123d-e123456789f} {event_start_timestamp_ms 1684469450000} {event_end_timestamp_ms <nil>} {stage_number <nil>} {active true}]"))
# Extract key-value pairs using regular expressions
pattern <- "\\{(.*?)\\s(.*?)\\}"
df$pairs <- str_extract_all(df$Stage, pattern)
# Remove empty strings from pairs column
df$pairs <- lapply(df$pairs, function(x) x[x != ""])
# Split the pairs into separate columns
df_split <- df %>%
unnest(pairs) %>%
separate(pairs, into = c("key", "value"), sep = " ", remove = FALSE) %>%
mutate(value = ifelse(value == "", NA, value)) %>%
mutate(key = gsub('[^A-Za-z0-9_]','', key),
value = gsub('[^A-Za-z0-9_-]','', value)
) %>%
select(-Stage)
df_pivot <- df_split %>%
pivot_wider(names_from = key, values_from = value) %>%
select(-c(key, pairs))
I've made it this far with my code, however when I get to the final step df_pivot, it outputs the correct columns, but it is 6 rows of data which looks like this: Current Results
I am trying to get it in the following format, but I am missing some step that will get me there.
stage_type | guid | event_start_timestamp_ms | event_end_timestamp_ms | stage_number |
---|---|---|---|---|
drive | 1234a-f3f0-123b-c123d-e123456789f | 1684469450000 | nil | nil |
Am I approaching this the wrong way?