I am attempting to take a collection of strings, tokenize the strings into individual characters, and restructure them into JSON for the purpose of building a cluster dendrogram visualization (sort of like this word tree, except for strings instead of sentences). As such, there are times when the sequence of characters is shared (or reshared) across the data.
So, for example, lets say I have a text file that looks like:
xin_qn2
x_qing4n3
x_qing4nian_
This is all I anticipate for my input; there's no CSV headings or anything associated with the data. The JSON object, then would look something like:
{
"name": "x",
"children": [
{
"name": i,
},
{
"name": _,
"children": [
{
"name": "q"
}
]
}
]
}
And so on. I've been trying to structure the data ahead of time, before sending it to D3.js, using Ruby to split the lines into individual characters, but I'm stuck trying to figure out how to structure this in hierarchical JSON.
file_contents = File.open("single.txt", "r")
file_contents.readlines.each do |line|
parse = line.scan(/[A-Za-z][^A-Za-z]*/)
puts parse
end
I could do this in the browser with d3.js instead, I just haven't tried that yet.
Just wondering if there are any suggestions, pointers, or existing tools/scripts that might help me out. Thanks!
Update 2014-10-02
So I've spent a little time trying this in Python, but I keep getting stuck. Nor am I handling "children" elements correctly, I now see. Any suggestions?
Attempt One
#!/usr/bin/python
from collections import defaultdict
import json
def tree():
return defaultdict(tree)
file_out = open('out.txt', 'wb')
nested = defaultdict(tree)
with open("single.txt") as f:
for line in f:
o = list(line)
char_lst = []
for chars in o:
d = {}
d['name']=chars
char_lst.append(d)
for word in d:
node = nested
for char in word:
node = node[char.lower()]
print node
print(json.dumps(nested))
Attempt Two
#!/usr/bin/python
from collections import defaultdict
import json
def tree():
return defaultdict(tree)
nested = defaultdict(tree)
words = list(open("single.txt"))
words_output = open("out.json", "wb")
for word in words:
node = nested
for char in word:
node = node[char.lower()]
def print_nested(d, indent=0):
for k, v in d.iteritems():
print '{}{!r}:'.format(indent * ' ', k)
print_nested(v, indent + 1)
print_nested(nested)