I have several very large not quite csv log files.
Given the following conditions:
- value fields have unescaped newlines and commas, almost anything can be in the value field including '='
- each valid line has an unknown number of valid value fields
- valid value looks like
key=value
such that a valid line looks likekey1=value1, key2=value2, key3=value3
etc. - the start of each valid line should begin with
eventId=<some number>,
What is the best way to read a file, split the file into correct lines and then parse each line into correct key value pairs?
I have tried
file_name = 'file.txt'
read_file = open(file_name, 'r').read().split(',\neventId')
This correctly parses the first entry but all other entries starts with =#
instead of eventId=#
. Is there a way to keep the deliminator and split on the valid newline?
Also, speed is very important.
Example Data:
eventId=123, key=value, key2=value2:
this, will, be, a problem,
maybe?=,
anotherkey=anothervalue,
eventId=1234, key1=value1, key2=value2, key3=value3,
eventId=12345, key1=
msg= {this is not a valid key value pair}, key=value, key21=value=,
Yes the file really is this messy (sometimes) each event here has 3 key value pairs although in reality there is an unknown number of key value pairs in each event.