11

Some mail clients, don't set the References headers, but Thread-Index.

Is there a way to parse this header in Python?

Related: How does the email header field 'thread-index' work?

Mail 1

Date: Tue, 2 Dec 2014 08:21:00 +0000
Thread-Index: AdAOBz5QJ/JuQSJMQTmSQ8+dVs2IDg==

Mail 2 (Which is related to Mail 1)

Date: Mon, 8 Dec 2014 13:12:13 +0000
Thread-Index: AdAOBz5QJ/JuQSJMQTmSQ8+dVs2IDgE4StZw

Update

I want to be able to link these two mails in my application. It already works perfectly for the well known References and In-Reply-To headers.

Community
  • 1
  • 1
guettli
  • 25,042
  • 81
  • 346
  • 663
  • What exactly are you trying to do with Thread-Index? What kind of info are you trying to retrieve? It seems that there is no python package to parse this header and you will have to implement something that suits your needs. [This post](http://www.solutionary.com/resource-center/blog/2014/04/thread-index-value-analysis/) may be useful as a first guide on parsing this header using python. If you specify what your needs are, maybe I could help. Good luck! – Lucas Infante Dec 11 '14 at 11:00
  • @maccinza I updated the question: I want to be able to link these two mails in my application. It already works perfectly for the well known References and In-Reply-To headers. – guettli Dec 11 '14 at 13:44

2 Answers2

11

Using the info here, I was able to put the following together:

import struct, datetime

def parse_thread_index(index):

    s = index.decode('base64')

    guid = struct.unpack('>IHHQ', s[6:22])
    guid = '{%08X-%04X-%04X-%04X-%12X}' % (guid[0], guid[1], guid[2], (guid[3] >> 48) & 0xFFFF, guid[3] & 0xFFFFFFFFFFFF)

    f = struct.unpack('>Q', s[:6] + '\0\0')[0]
    ts = [datetime.datetime(1601, 1, 1) + datetime.timedelta(microseconds=f//10)]

    for n in range(22, len(s), 5):
        f = struct.unpack('>I', s[n:n+4])[0]
        ts.append(ts[-1] + datetime.timedelta(microseconds=(f<<18)//10))

    return guid, ts

Given a thread index, it returns a tuple (guid, [list of dates]). For your test data, the result is:

 > parse_thread_index('AdAOBz5QJ/JuQSJMQTmSQ8+dVs2IDgE4StZw')
('{27F26E41-224C-4139-9243-CF9D56CD880E}', [datetime.datetime(2014, 12, 2, 8, 9, 6, 673459), datetime.datetime(2014, 12, 8, 13, 11, 0, 807475)])

I don't have enough test data at hand, so this code might be buggy. Feel free to let me know.

georg
  • 211,518
  • 52
  • 313
  • 390
  • Fails for 'kp4o6SAzO6Xc19R5OPjnmqbg6v2utA==': OverflowError: date value out of range – guettli Jan 16 '15 at 11:48
  • @guettli: this doesn't look like a valid header (first byte must be `1`). – georg Jan 18 '15 at 12:13
  • 1
    I found a other approach for this problem, which in my opinion is a little bit more readable than yours. But at the moment your solution works since several years in production without any problems. So this is just for information if others or myself needs a backup solution anytime in the future. [https://technical.nttsecurity.com/post/102enx6/outlook-thread-index-value-analysis] You can find the code on the buttom of the linked page. – tzanke Nov 06 '19 at 08:02
1

Here is @georg's answer updated to Python 3:

import base64, struct, datetime

def parse_thread_index(index):

    s = base64.b64decode(index)

    guid = struct.unpack('>IHHQ', s[6:22])
    guid = '{%08X-%04X-%04X-%04X-%12X}' % (guid[0], guid[1], guid[2], (guid[3] >> 48) & 0xFFFF, guid[3] & 0xFFFFFFFFFFFF)

    f = struct.unpack('>Q', s[:6] + b'\0\0')[0]
    ts = [datetime.datetime(1601, 1, 1) + datetime.timedelta(microseconds=f//10)]

    for n in range(22, len(s), 5):
        f = struct.unpack('>I', s[n:n+4])[0]
        ts.append(ts[-1] + datetime.timedelta(microseconds=(f<<18)//10))

    return guid, ts
rigo
  • 56
  • 1
  • 2