3

I have a json-like string that represents a nested structure. it is not a real json in that the names and values are not quoted. I want to parse it to a nested structure, e.g. list of lists.

#example:
x_string = "{a=1, b=2, c=[1,2,3], d={e=something}}"

and the result should be like this:

x_list = list(a=1,b=2,c=c(1,2,3),d=list(e="something"))

is there any convenient function that I don't know that does this kind of parsing?

Thanks.

amit
  • 3,332
  • 6
  • 24
  • 32
  • The no-quotes thing is nasty. What are the restrictions regarding what kind on non-numeric values are possible? It might be possible to change this to valid R code by using regex and then `eval(parse())` the result. – Roland Dec 21 '17 at 13:37

2 Answers2

1

If all of your data is consistent, there is a simple solution involving regex and jsonlite package. The code is:

if(!require(jsonlite, quiet=TRUE)){ 
    #if library is not installed: installs it and loads it into the R session for use.

    install.packages("jsonlite",repos="https://ftp.heanet.ie/mirrors/cran.r-project.org")
    library(jsonlite)
}

x_string = "{a=1, b=2, c=[1,2,3], d={e=something}}"

json_x_string = "{\"a\":1, \"b\":2, \"c\":[1,2,3], \"d\":{\"e\":\"something\"}}"
fromJSON(json_x_string)

s <- gsub( "([A-Za-z]+)", "\"\\1\"",  gsub( "([A-Za-z]*)=", "\\1:", x_string ) )

fromJSON( s )

The first section checks if the package is installed. If it is it loads it, otherwise it installs it and then loads it. I usually include this in any R code I'm writing to make it simpler to transfer between pcs/people.

Your string is x_string, we want it to look like json_x_string which gives the desired output when we call fromJSON().

The regex is split into two parts because it's been a while - I'm pretty sure this could be made more elegant. Then again, this depends on if your data is consistent so I'll leave it like this for now. First it changes "=" to ":", then it adds quotation marks around all groups of letters. Calling fromJSON(s) gives the output:

fromJSON(s)

$a

[1] 1

$b

[1] 2

$c

[1] 1 2 3

$d

$d$e

[1] "something"

AodhanOL
  • 630
  • 7
  • 26
0

I would rather avoid using JSON's parsing for the lack of extendibility and flexibility, and stick to a solution of regex + recursion.

And here is an extendable base code that parses your input string as desired

The main recursion function:

# Parse string
parse.string = function(.string){
  regex = "^((.*)=)??\\{(.*)\\}"

  # Recursion termination: element parsing
  if(iselement(.string)){
    return(parse.element(.string))
  }

  # Extract components 
  elements.str = gsub(regex, "\\3", .string)
  elements.vector = get.subelements(elements.str)

  # Recursively parse each element
  parsed.elements = list(sapply(elements.vector, parse.string, USE.NAMES = F))

  # Extract list's name and return 
  name = gsub(regex, "\\2", .string)
  names(parsed.elements) = name
  return(parsed.elements)
}

.

Helping functions:

library(stringr)

# Test if the string is a base element
iselement = function(.string){
  grepl("^[^[:punct:]]+=[^\\{\\}]+$", .string)
}

# Parse element
parse.element = function(element.string){
  splits = strsplit(element.string, "=")[[1]]
  element = splits[2]

  # Parse numeric elements
  if(!is.na(as.numeric(element))){
    element = as.numeric(element)
  }

  # TODO: Extend here to include vectors

  # Reformat and return 
  element = list(element)
  names(element) = splits[1]
  return(element)
}

# Get subelements from a string
get.subelements = function(.string){
  # Regex of allowed elements - Extend here to include more types 
  elements.regex = c("[^, ]+?=\\{.+?\\}", #Sublist
                     "[^, ]+?=\\[.+?\\]", #Vector
                     "[^, ]+?=[^=,]+")    #Base element
  str_extract_all(.string, pattern = paste(elements.regex, collapse = "|"))[[1]]
}

.

Parsing results:

string = "{a=1, b=2, c=[1,2,3], d={e=something}}"
string_2 = "{a=1, b=2, c=[1,2,3], d=somthing}"
named_string = "xyz={a=1, b=2, c=[1,2,3], d={e=something, f=22}}"
named_string_2 = "xyz={d={e=something, f=22}}"

parse.string(string)
# [[1]]
# [[1]]$a
# [1] 1
# 
# [[1]]$b
# [1] 2
# 
# [[1]]$c
# [1] "[1,2,3]"
# 
# [[1]]$d
# [[1]]$d$e
# [1] "something"
Deena
  • 5,925
  • 6
  • 34
  • 40