0

I am working with a dataframe that contains a column called stage that is in a custom string format. It seems to resemble JSON but isn't quite in that format. My initial thought was to try and convert it to JSON so then I could use some existing functions to convert it into the format I want.

I am trying to take that stage column, parse each of the key value pairs within, and make those into their own columns (key) and the values being in one row.

What I've tried so far:

library(stringr)
library(dplyr)
library(tidyr)

# Sample dataframe
df <- data.frame(Stage = c("[{stage_type drive} {guid 1234a-f3f0-123b-c123d-e123456789f} {event_start_timestamp_ms 1684469450000} {event_end_timestamp_ms <nil>} {stage_number <nil>} {active true}]"))

# Extract key-value pairs using regular expressions
pattern <- "\\{(.*?)\\s(.*?)\\}"
df$pairs <- str_extract_all(df$Stage, pattern)

# Remove empty strings from pairs column
df$pairs <- lapply(df$pairs, function(x) x[x != ""])

# Split the pairs into separate columns
df_split <- df %>%
  unnest(pairs) %>%
  separate(pairs, into = c("key", "value"), sep = " ", remove = FALSE) %>%
  mutate(value = ifelse(value == "", NA, value)) %>%
  mutate(key = gsub('[^A-Za-z0-9_]','', key), 
         value = gsub('[^A-Za-z0-9_-]','', value)
) %>%
  select(-Stage)

df_pivot <- df_split %>%
  pivot_wider(names_from = key, values_from = value) %>%
  select(-c(key, pairs))

I've made it this far with my code, however when I get to the final step df_pivot, it outputs the correct columns, but it is 6 rows of data which looks like this: Current Results

I am trying to get it in the following format, but I am missing some step that will get me there.

stage_type guid event_start_timestamp_ms event_end_timestamp_ms stage_number
drive 1234a-f3f0-123b-c123d-e123456789f 1684469450000 nil nil

Am I approaching this the wrong way?

matt
  • 13
  • 3
  • 1
    You've almost got it! I simply removed the `pairs` column before using the `pivot` like this: `df_pivot <- df_split %>% select(-pairs) %>% pivot_wider(names_from = key, values_from = value)` to get the output you have listed – mfg3z0 May 30 '23 at 16:01
  • @mfg3z0 Thank you! I knew I was close but have been messing with it for a while now and over looked this idea. – matt May 30 '23 at 16:11

2 Answers2

1

@mfg3z0 solved it.

By removing pairs, prior to pivot_wider(), this allows the output to be one row rather than multiple.

df_pivot <- df_split %>% select(-pairs) %>% pivot_wider(names_from = key, values_from = value)
matt
  • 13
  • 3
0

in base R you could do:

a <- gsub("(\\w\\S+) ([^}]+)(\\} \\{)?", "\\1:\\2\n", df$Stage)
b <- gsub("\\[\\{|\\}\\]", "\n\n", a)
read.dcf(textConnection(b), all = TRUE)

  stage_type                              guid event_start_timestamp_ms event_end_timestamp_ms stage_number active
1      drive 1234a-f3f0-123b-c123d-e123456789f            1684469450000                  <nil>        <nil>   true

The first regex is to replace key pair} { to key:pair\n then the second regex is to [{ or }] to \n\n

Onyambu
  • 67,392
  • 3
  • 24
  • 53