0

I am attempting to take a collection of strings, tokenize the strings into individual characters, and restructure them into JSON for the purpose of building a cluster dendrogram visualization (sort of like this word tree, except for strings instead of sentences). As such, there are times when the sequence of characters is shared (or reshared) across the data.

So, for example, lets say I have a text file that looks like:

xin_qn2
x_qing4n3
x_qing4nian_

This is all I anticipate for my input; there's no CSV headings or anything associated with the data. The JSON object, then would look something like:

{
    "name": "x",
    "children": [
        {
            "name": i,
        },
        {
            "name": _,
            "children": [
                {
                    "name": "q"
                }
            ]
        }
    ]
}

And so on. I've been trying to structure the data ahead of time, before sending it to D3.js, using Ruby to split the lines into individual characters, but I'm stuck trying to figure out how to structure this in hierarchical JSON.

file_contents = File.open("single.txt", "r")

file_contents.readlines.each do |line|
  parse = line.scan(/[A-Za-z][^A-Za-z]*/)
  puts parse
end

I could do this in the browser with d3.js instead, I just haven't tried that yet.

Just wondering if there are any suggestions, pointers, or existing tools/scripts that might help me out. Thanks!

Update 2014-10-02

So I've spent a little time trying this in Python, but I keep getting stuck. Nor am I handling "children" elements correctly, I now see. Any suggestions?

Attempt One

#!/usr/bin/python

from collections import defaultdict
import json

def tree():
    return defaultdict(tree)

file_out = open('out.txt', 'wb')

nested = defaultdict(tree)

with open("single.txt") as f:
    for line in f:
        o = list(line)
        char_lst = []
        for chars in o:
            d = {}
            d['name']=chars
            char_lst.append(d)
        for word in d:
            node = nested
            for char in word:
                node = node[char.lower()]
                print node

print(json.dumps(nested))

Attempt Two

#!/usr/bin/python

from collections import defaultdict
import json

def tree():
    return defaultdict(tree)

nested = defaultdict(tree)

words = list(open("single.txt"))
words_output = open("out.json", "wb")

for word in words:
    node = nested
    for char in word:
        node = node[char.lower()]

def print_nested(d, indent=0):
  for k, v in d.iteritems():
    print '{}{!r}:'.format(indent * ' ', k)
    print_nested(v, indent + 1)

print_nested(nested)
Jason Heppler
  • 706
  • 9
  • 29
  • You need to make a bunch of dictionaries, then store them in lists. I can't say about Ruby but Python makes this very easy. – Union find Sep 28 '14 at 22:57

1 Answers1

1

You're almost there with attempt #2. Adding json.dumps(nested) to the end would print the following JSON:

{
  "x":{
    "i":{
      "n":{
        "_":{
          "q":{
            "n":{
              "2":{

              }
            }
          }
        }
      }
    },
    "_":{
      "q":{
        "i":{
          "n":{
            "g":{
              "4":{
                "n":{
                  "i":{
                    "a":{
                      "n":{
                        "_":{

                        }
                      }
                    }
                  },
                  "3":{

                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

Close, but not quite what you want. By the way, you can also convert the nested defaultdict into a regular dict using the following function:

def convert(d):
    return dict((key, convert(value)) for (key,value) in d.iteritems()) if isinstance(d, defaultdict) else d

But we still only have a dict of dicts (of dicts...). Using recursion, we can convert it to your required format as follows:

def format(d):
    children = []
    for (key, value) in d.iteritems():
        children += [{"name":key, "children":format(value)}]
    return children

Finally, let's print out the json:

print json.dumps(format(convert(nested)))

This prints the following JSON (formatted for clarity):

[
  {
    "name":"x",
    "children":[
      {
        "name":"i",
        "children":[
          {
            "name":"n",
            "children":[
              {
                "name":"_",
                "children":[
                  {
                    "name":"q",
                    "children":[
                      {
                        "name":"n",
                        "children":[
                          {
                            "name":"2",
                            "children":[

                            ]
                          }
                        ]
                      }
                    ]
                  }
                ]
              }
            ]
          }
        ]
      },
      {
        "name":"_",
        "children":[
          {
            "name":"q",
            "children":[
              {
                "name":"i",
                "children":[
                  {
                    "name":"n",
                    "children":[
                      {
                        "name":"g",
                        "children":[
                          {
                            "name":"4",
                            "children":[
                              {
                                "name":"n",
                                "children":[
                                  {
                                    "name":"i",
                                    "children":[
                                      {
                                        "name":"a",
                                        "children":[
                                          {
                                            "name":"n",
                                            "children":[
                                              {
                                                "name":"_",
                                                "children":[

                                                ]
                                              }
                                            ]
                                          }
                                        ]
                                      }
                                    ]
                                  },
                                  {
                                    "name":"3",
                                    "children":[

                                    ]
                                  }
                                ]
                              }
                            ]
                          }
                        ]
                      }
                    ]
                  }
                ]
              }
            ]
          }
        ]
      }
    ]
  }
]

Here's the complete code:

#!/usr/bin/python

from collections import defaultdict
import json

def tree():
    return defaultdict(tree)

nested = defaultdict(tree)

words = open("single.txt").read().splitlines()
words_output = open("out.json", "wb")

for word in words:
    node = nested
    for char in word:
        node = node[char.lower()]

def convert(d):
    return dict((key, convert(value)) for (key,value) in d.iteritems()) if isinstance(d, defaultdict) else d

def format(d):
    children = []
    for (key, value) in d.iteritems():
        children += [{"name":key, "children":format(value)}]
    return children

print json.dumps(format(convert(nested)))
OrionMelt
  • 2,531
  • 18
  • 17