0

Background: I'm converting a PDF of user messages into a text file and trying to rebuild the message threads in a structured data format.

The problem: I have built a function that scans each line of text, detects a thread_id and marks that line as belonging to the appropriate thread_id and then creates a list of lists structured as such:

thread_lines = [['1234567890', 'Dear James,']
                ['1234567890', 'See you soon.']
                ['5558881112', 'Foobar']]

Item 0 of each inner list is the thread_id. Ideally I'd like to create a dictionary where each thread_id is a key and all lines of the same thread_id are concatenated together as the corresponding value.

The code: I have a function, which I've omitted here, called check_thread that uses regex to identify thread_id. Below is the small function that scans and categorizes each line.

def thread_create(text):
    thread_lines = []
    thread_id = None
    thread_dict = {}

    for line in range(len( text )):
        # is line beginning of new thread?
        if 'Thread' in text[line]:
            if check_thread(text[line]) != None:
                thread_id = check_thread(text[line])
            elif check_thread(text[line+1]) != None:
                thread_id = check_thread(text[line+1])

        #line belongs to current thread, do something
        if thread_id != None:
            thread_lines.append([thread, text[line]])

Can anyone offer any advice, or perhaps a method to wrangling this data in the way I require?

martineau
  • 119,623
  • 25
  • 170
  • 301
Jon Behnken
  • 560
  • 1
  • 3
  • 14

1 Answers1

2

If I understood correctly, this should do it:

thread_lines = [['1234567890', 'Dear James,'],
                ['1234567890', 'See you soon.'],
                ['5558881112', 'Foobar']]


result = {}
for tid, sentence in thread_lines:
    result.setdefault(tid, []).append(sentence)

print(result)

Output

{'1234567890': ['Dear James,', 'See you soon.'], '5558881112': ['Foobar']}
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76