Background: I'm converting a PDF of user messages into a text file and trying to rebuild the message threads in a structured data format.
The problem: I have built a function that scans each line of text, detects a thread_id
and marks that line as belonging to the appropriate thread_id
and then creates a list of lists structured as such:
thread_lines = [['1234567890', 'Dear James,']
['1234567890', 'See you soon.']
['5558881112', 'Foobar']]
Item 0 of each inner list is the thread_id
. Ideally I'd like to create a dictionary where each thread_id
is a key and all lines of the same thread_id
are concatenated together as the corresponding value.
The code: I have a function, which I've omitted here, called check_thread
that uses regex to identify thread_id
. Below is the small function that scans and categorizes each line.
def thread_create(text):
thread_lines = []
thread_id = None
thread_dict = {}
for line in range(len( text )):
# is line beginning of new thread?
if 'Thread' in text[line]:
if check_thread(text[line]) != None:
thread_id = check_thread(text[line])
elif check_thread(text[line+1]) != None:
thread_id = check_thread(text[line+1])
#line belongs to current thread, do something
if thread_id != None:
thread_lines.append([thread, text[line]])
Can anyone offer any advice, or perhaps a method to wrangling this data in the way I require?