Parsing a large pseudo csv log file in Python

Question

I have several very large not quite csv log files.

Given the following conditions:

value fields have unescaped newlines and commas, almost anything can be in the value field including '='
each valid line has an unknown number of valid value fields
valid value looks like key=value such that a valid line looks like key1=value1, key2=value2, key3=value3 etc.
the start of each valid line should begin with eventId=<some number>,

What is the best way to read a file, split the file into correct lines and then parse each line into correct key value pairs?

I have tried

file_name = 'file.txt'
read_file = open(file_name, 'r').read().split(',\neventId')

This correctly parses the first entry but all other entries starts with =# instead of eventId=#. Is there a way to keep the deliminator and split on the valid newline?

Also, speed is very important.

Example Data:

eventId=123, key=value, key2=value2:
this, will, be, a problem,
maybe?=,
anotherkey=anothervalue,
eventId=1234, key1=value1, key2=value2, key3=value3,
eventId=12345, key1=
msg= {this is not a valid key value pair}, key=value, key21=value=,

Yes the file really is this messy (sometimes) each event here has 3 key value pairs although in reality there is an unknown number of key value pairs in each event.

I'd start by splitting it up into entries utilizing property 4. eg split at eveyr instance of `eventid=\d+`. From there it's a simple matter of splitting utilizing a regex that matches `=,` into a dictionary perhaps. — CollinD, Sep 23 '15 at 22:02
I am trying to use property 4 to split the lines on the read but the way I am currently doing it removes the delimiter. — deltap, Sep 23 '15 at 22:06
Since the delimiter is static, you could always just add it back in. I'm not terribly familiar with Python so I can't provide a ton of help there, sorry. — CollinD, Sep 23 '15 at 22:06
With speed being an issue I was hoping that there was a cleaner way. I know how to append the string but the entire correction will involve an if statement (to check if the 'eventId' is there or not and then a string concatenation. Both are slow. — deltap, Sep 23 '15 at 22:17
If the values can contain unescaped and unquoted equals signs, there's probably no unambiguous way to parse a given line. If the values can contain `"\neventId=#"`, you can't even unambiguously match the lines. — Blckknght, Sep 23 '15 at 23:13
I've thought of this, while this is possible I am willing to make the assumption that this will not happen in the file, otherwise as you said all hope is lost. — deltap, Sep 23 '15 at 23:19
Do the keys in the `key=value` structure have any well-defined nature? That is, do they only contain letters, or something? E.g, can you tell if `"maybe?"` is a key (with an empty value) in your example? — Blckknght, Sep 23 '15 at 23:58
I don't know this for sure but I think keys are only letters and numbers. I am trying to get a list of valid keys but I don't have that yet. — deltap, Sep 24 '15 at 00:00
Can you add expected output as your description does not add up to what you have added as input — Padraic Cunningham, Sep 24 '15 at 00:13

Padraic Cunningham · Answer 1 · 2015-09-24T00:42:35.500

If the start of each valid line should begin with eventId= is correct, you can groupby those lines and find valid pairs with a regex:

from itertools import groupby
import re
with open("test.txt") as f:
    r = re.compile("\w+=\w+")
    grps = groupby(f, key=lambda x: x.startswith("eventId="))
    d = dict(l.split("=")  for k, v in grps if k
             for l in r.findall(next(v))[1:])
    print(d)
    {'key3': 'value3', 'key2': 'value2', 'key1': 'value1', 'goodkey': 'goodvalue'}

If you want to keep the eventIds:

import re
with open("test.txt") as f:
    r = re.compile("\w+=\w+")
    grps = groupby(f, key=lambda x: x.startswith("eventId="))
    d = list(r.findall(next(v)) for k, v in grps if k)
    print(d)
[['eventId=123', 'goodkey=goodvalue', 'key2=somestuff'], ['eventId=1234', 'key1=value1', 'key2=value2', 'key3=value3']]

Not clear from your description exactly what the output should be, if you want all the valids key=value pairs and if the start of each valid line should begin with eventId= is not accurate:

from itertools import groupby,chain
import re
def parse(fle):
    with open(fle) as f:
        r = re.compile("\w+=\w+")
        grps = groupby(f, key=lambda x: x.startswith("eventId="))
        for k, v in grps:
            if k:
                sub = "".join((list(v)) + list(next(grps)[1]))
                yield from r.findall(sub)

print(list(parse("test.txt")))

Output:

['eventId=123', 'key=value', 'key2=value2', 'anotherkey=anothervalue',   
'eventId=1234', 'key1=value1', 'key2=value2', 'key3=value3', 
'eventId=12345', 'key=value', 'key21=value']

score 0 · Answer 2 · answered Sep 24 '15 at 00:02

This problem is pretty insane, but here's a solution that seems to work. Always use an existing library to output formatted data, kids.

import re;

in_string = """eventId=123, goodkey=goodvalue, key2=somestuff:
this, will, be, a problem,
maybe?=,
anotherkey=anothervalue, gotit=see,
the problem===s,
eventId=1234, key1=value1, key2=value2, key3=value3,
eventId=12345, key1=
msg= {this is not a valid key value pair}, validkey=validvalue,"""

line_matches = list(re.finditer(r'(,\n)?eventId=\d', in_string))

lines = []
for i in range(len(line_matches)):
    match_start = line_matches[i].start()
    next_match_start = line_matches[i+1].start() if i < len(line_matches)-1 else len(in_string)-1
    line = in_string[match_start:next_match_start].lstrip(',\n')
    lines.append(line)

lineDicts = []
for line in lines:
    d = {}
    pad_line = ', '+line
    matches = list(re.finditer(r', [\w\d]+=', pad_line))
    for i in range(len(matches)):
        match = matches[i]
        key = match.group().lstrip(', ').rstrip('=')
        next_match_start = matches[i+1].start() if i < len(matches)-1 else len(pad_line)
        value = pad_line[match.end():next_match_start]
        d[key] = value
    lineDicts.append(d)

print lineDicts

Outputs [{'eventId': '123', 'key2': 'somestuff:\nthis, will, be, a problem,\nmaybe?=,\nanotherkey=anothervalue', 'goodkey': 'goodvalue', 'gotit': 'see,\nthe problem===s'}, {'eventId': '1234', 'key2': 'value2', 'key1': 'value1', 'key3': 'value3'}, {'eventId': '12345', 'key1': '\nmsg= {this is not a valid key value pair}', 'validkey': 'validvalue'}]

Thank you, I'll give this a go as soon as I get home. I am in 100% agreement with you about using libraries for I/O. This nightmare of a file was exported from 3rd party software written by a fortune 500 company who will remain nameless. As someone fairly new to the workforce I am amazed at the ineptitude of commercial software. — deltap, Sep 24 '15 at 00:06
Are you sure it's incompetence? There's lots extra profit to be made if they can convince a customer that something trivially easy is in fact hard. Sorry, back to tech stuff now. — nigel222, Sep 24 '15 at 08:29

score 0 · Answer 3 · answered Sep 24 '15 at 01:08

If your values are can really contain anything, there's no unambiguous way of parsing. Any key=value pair could be part of the preceding value. Even a eventID=# pair on a new line could be part of a value from the previous line.

Now, perhaps you can do a "good enough" parse on the data despite the ambiguity, if you assume that values will never contain valid looking key= substrings. If you know the possible keys (or at least, what constraints they have, like being alphanumeric), it will be a lot easier to guess at what is a new key and what is just part of the previous value.

Anyway, if we assume that all alphanumeric strings followed by equals signs are indeed keys, we can do a parse with regular expressions. Unfortunately, there's no easy way to do this line by line, nor is there a good way to capture all the key-value pairs in a single scan. However, it's not too hard to scan once to get the log lines (which may have embedded newlines) and then separately get the key=value, pairs for each one.

with open("my_log_file") as infile:
    text = infile.read()

line_pattern = r'(?S)eventId=\d+,.*?(?:$|(?=\neventId=\d+))'
kv_pattern = r'(?S)(\w+)=(.*?),\s*(?:$|(?=\w+=))'
results = [re.findall(kv_pattern, line) for line in re.findall(line_pattern, text)]

I'm assuming that the file is small enough to fit into memory as a string. It would be quite a bit more obnoxious to solve the problem if the file can't all be handled at once.

If we run this regex matching on your example text, we get:

[[('eventId', '123'), ('key', 'value'), ('key2', 'value2:\nthis, will, be, a problem,\nmaybe?='), ('anotherkey', 'anothervalue')],
 [('eventId', '1234'), ('key1', 'value1'), ('key2', 'value2'), ('key3', 'value3')],
 [('eventId', '12345'), ('key1', '\nmsg= {this is not a valid key value pair}'), ('key', 'value'), ('key21', 'value=')]]

maybe? is not considered a key because of the question mark. msg and the final value are not considered keys because there were no commas separating them from a previous value.

Alea Kootz · Answer 4 · 2015-09-23T23:10:49.283

-1

Oh! This is an interesting problem, you'll want to process each line and part of line separately without iterating though the file more than once.

data_dict = {}
file_lines = open('file.txt','r').readlines()
for line in file_lines:
    line_list = line.split(',')
    if len(line_list)>=1:
        if 'eventId' in line_list[0]:
            for item in line_list:
                pair = item.split('=')
                data_dict.update({pair[0]:pair[1]})

That should do it. Enjoy!

If there are spaces in the 'pseudo csv' please change the last line to:

data_dict.update({pair[0].split():pair[1].split()})

In order to remove spaces from the strings for your key and value.

p.s. If this answers your question, please click the check mark on the left to record this as an accepted answer. Thanks!

p.p.s. A set of lines from your actual data would be very helpful in writing something to avoid error cases.

edited Sep 23 '15 at 23:10

answered Sep 23 '15 at 23:01

Alea Kootz

913
4
11

IndexError: list index out of range I don't think you are accounting for newlines in the values. I also don't think you are taking to account that there can be =, \n, and ',' in the value fields. – deltap Sep 23 '15 at 23:07
Would you show us a few lines of your data? I'm operating under the 'sort of csv' assumption here. – Alea Kootz Sep 23 '15 at 23:09
tried your update: TypeError: unhashable type: 'list' The file is huge and I am not at liberty of disclosing actual lines (any samples may not contain all cases). The rules I provided seem to be robust. I'd have to generate some 'fake' data. – deltap Sep 23 '15 at 23:10
Well, I'll look at this after I get home. My answer works for the data you provided in section 3 of your question, I'm willing to try and write some way to handle weird stuff in value fields, but without delimiters, you're going to need some complex regex's. – Alea Kootz Sep 23 '15 at 23:13
added examples of what can happen – deltap Sep 23 '15 at 23:19

Parsing a large pseudo csv log file in Python

4 Answers4