How to convert txt file to json using leading spaces?

Question

I have formatted txt file looks like this:

Hostinfo Start
  DATE 190819 1522
  HOST midas
  DOMAIN test.de
  HW_PLATFORM x86_64
  SERVER_TYPE virtual
  CPU_INFO
    CPU_TYPE  Intel(R) Xeon(R) CPU E7-8867 v4 @ 2.40GHz
    CPU_COUNT 2
    CORE_COUNT 2
  THREAD_COUNT 8
  MEMORY       32951312 kB
  OS Start
    OS Linux
    OS_VERSION 4.9.0-6-amd64
    OS_UPTIME 536 days 21:08
    OS End
  RELEASE Debian GNU/Linux 9 (stretch)
  RELEASE_VERSION 9
  RELEASE_PATCHLEVEL
Hostinfo End

Using the count of leading spaces need to convert it to json format looking similar to this:

"Hostinfo": [
{
  "DATE": "190819 1522"
  "HOST": "midas"
  "DOMAIN": "test.de"
  "HW_PLATFORM": "x86_64"
  "SERVER_TYPE": "virtual"
  "CPU_INFO": {
    "CPU_TYPE": "Intel(R) Xeon(R) CPU E7-8867 v4 @ 2.40GHz"
    "CPU_COUNT": "2"
    "CORE_COUNT": "2"
    }
  "THREAD_COUNT": "8"
  "MEMORY": "32951312 kB"
  "OS": [
      {
      "OS": "Linux"
      "OS_VERSION": "4.9.0-6-amd64"
      "OS_UPTIME": "536 days 21:08"
      }
    ]
  "RELEASE": "Debian GNU/Linux 9 (stretch)"
  "RELEASE_VERSION": "9"
  "RELEASE_PATCHLEVEL" : ""
}
]

I have some undertakings of this script but can't workaround how to set lines between curly brackets as object of upper dictionary (level):

#!/usr/bin/python
import json
import itertools
import string
import re

filename = 'commands.txt'

commands = {}
with open(filename) as fh:
    previous_line = 0
    mark_line = ""

    for line in fh:
        current_line = ((len(line) - len(line.lstrip()))/2) 

        diff = current_line - previous_line
        if re.search(' Start$', line.strip()):
            line = line.strip().replace(' Start', ':{')
            print(line)
            mark_line = "start_line"
        elif re.search(' Ende$', line.strip()):
            line = line.strip().replace(' Ende', '')
            print("}")
            mark_line = "end_line"
        elif diff == 0:
            print(line.strip())S
        elif diff > 0:
            if mark_line == "start_line" or mark_line == "end_line":
                mark_line = "0"
            else:
                print("{")
                print(line.strip())
        elif diff < 0:
            if mark_line == "start_line" or mark_line == "end_line":
                mark_line = "0"
            else:
                print("}")
                print(line.strip())
        previous_line = ((len(line) - len(line.lstrip()))/2)


        #line = (str((len(line) - len(line.lstrip()))/2) + ";" + line.strip())

        try:
            command, description = line.strip().split(' ', 1)
            commands[command] = description.strip()
        except Exception:
            command = line.strip()
            description = ""
            commands[command] = description.strip()


print(json.dumps(commands, indent=2, sort_keys=True))

May be you can get me some idea how to workaround this or take some advice? May it would be some module that simplifies this script?

UPD: add some json markup to my mess script. Can you please get advice if I moving at wrong/right way?

Your input file _almost_ looks like valid YAML (q.v. [Converting a YAML file to Python JSON object](https://stackoverflow.com/questions/50846431/converting-a-yaml-file-to-python-json-object)), but it is slightly off. You should go back to your data source and try to obtain a valid format directly, e.g. YAML, JSON, XML, etc. — Tim Biegeleisen, Oct 18 '19 at 03:39
It looks like `OS` has a `Start` tag, but `CPU INFO` doesn't. Is that right? — Chris W., Oct 18 '19 at 03:51
Syntactically, this format is a mess. There seem to be explicit keywords `Start` and `End` for block boundaries used in `Hostinfo` and `OS`, but then again not with `CPU INFO`, where the block is indicated by the indentation. Should this just be treated as equal? How would you convert from json back to this format, having 2 format options for exactly the same thing? ;) — Jeronimo, Oct 18 '19 at 05:50
why `CPU INFO` has no underscore and does not become `CPU` while `OS Start` becomes `OS`? — lenik, Oct 18 '19 at 07:05
Ok, thanks all for your attention lets say the Start, End blocks are would be json arrays and indentation without Start and End blocks would be json dictionaries. PS: Update the output, sorry for the mess. Also CPU_INFO is would have underline, my mistake. In case of other issues - I can't make any influence to input. — Kein, Oct 18 '19 at 08:09

Ajax1234 · Accepted Answer · 2019-10-21T22:35:26.617

You can use itertools.groupby with recursion:

import itertools as it, re
data = [[*re.findall('^\s+', b), *re.split('(?<=[A-Z])\s+', i)] for b in open('os_stuff.txt') if not (i:=re.sub('^\s+|\sStart\n$', '', b)).endswith('End\n')]
def to_tree(d):
   _d = [(a, list(b)) for a, b in it.groupby(d, key=lambda x:bool(re.findall('^\s+$', x[0])))]
   new_dict, _last = {}, None
   for i, [a, b] in enumerate(_d):
      if not a:
         for j, *k in b:
            if not k or (not k[0] and i < len(_d) - 2):
               _last = j
            else:
               new_dict[j] = ' '.join(k).strip('\n')
      else:
         new_dict[_last] = [to_tree([[k[2:], *j] if k[2:] else j for k, *j in b])]
   return new_dict

import json
print(json.dumps(to_tree(data), indent=4))

Output:

{
  "Hostinfo": [
    {
        "DATE": "190819 1522",
        "HOST": "midas",
        "DOMAIN": "test.de",
        "HW_PLATFORM": "x86_64",
        "SERVER_TYPE": "virtual",
        "CPU_INFO": [
            {
                "CPU_TYPE": "Intel(R) Xeon(R) CPU E7-8867 v4 @ 2.40GHz",
                "CPU_COUNT": "2",
                "CORE_COUNT": "2"
            }
        ],
        "THREAD_COUNT": "8",
        "MEMORY": "32951312 kB ",
        "OS": [
            {
                "OS": "Linux",
                "OS_VERSION": "4.9.0-6-amd64",
                "OS_UPTIME": "536 days 21:08"
            }
        ],
        "RELEASE": "Debian GNU/Linux 9 (stretch)",
        "RELEASE_VERSION": "9",
        "RELEASE_PATCHLEVEL": ""
     }
  ]
}

Edit: Python2.7 solution:

import itertools as it, re
new_data = [[i, re.sub('^\s+|\sStart\n$', '', i)] for i in open('os_stuff.txt')]
data = [re.findall('^\s+', a)+re.split('(?<=[A-Z])\s+', b) for a, b in new_data if not b.endswith('End\n')]
def to_tree(d):
  _d = [(a, list(b)) for a, b in it.groupby(d, key=lambda x:bool(re.findall('^\s+$', x[0])))]
  new_dict, _last = {}, None
  for i, [a, b] in enumerate(_d):
     if not a:
       for j_k in b:
         if not j_k[1:] or (not j_k[1:][0] and i < len(_d) - 2):
            _last = j_k[0]
         else:
            new_dict[j_k[0]] = ' '.join(j_k[1:]).strip('\n')
     else:
       new_dict[_last] = [to_tree([[k_j[0][2:]]+k_j[1:] if k_j[0][2:] else k_j[1:] for k_j in b])]
  return new_dict


print(to_dict(data))

Sorry, I'm trying to run your script but python output syntax error :( in data variable line — Kein, Oct 20 '19 at 18:17
If I understand right this script working only with python 3.8 — Kein, Oct 21 '19 at 22:02
@Kein That is correct, it uses an assignment expression. I will update the solution to be < Python3.8 compatible..... — Ajax1234, Oct 21 '19 at 22:25
@Kein No problem, please see my recent edit. Also, if this answer assisted you, please [accept it](https://stackoverflow.com/help/someone-answers). — Ajax1234, Oct 21 '19 at 22:36
Thanks a lot :) The last line should be `print(json.dumps(to_tree(data), indent=4))` — Kein, Oct 21 '19 at 23:01

How to convert txt file to json using leading spaces?

1 Answers1