0

I have two email testfiles:

  1. A file that has been created by using "save as" in Mac Mail (this creates a .txt file)
  2. A file that has been created by dragging an email from Mac Mail to the Desktop (this creates an .eml file)

If I feed the files with

curl -T filename http://localhost:9998/detect/stream

I get the response "message/rfc822" for both files.

If I run

curl -T filename http://localhost:9998/meta

I get the metadata, but in the case of (1) I do not get the date extracted, while in case (2) I do.

I understand, of course, that the .eml file includes the full raw header, while the .txt file only includes a very abbreviated header. However, even the abbreviated header does include a "Date" field, and so I think Tika should extract it. Is this a bug or intentional? In the latter case, is there anything I could do to get the Tika to extract the date in case (1)?

I am running Tika-server 1.14.

Philipp
  • 55
  • 6

1 Answers1

1

Thank you for opening TIKA-1970; the underlying James' mime4j library isn't able to parse a date of format "16 May 2016 at 09:30:32 GMT+1". We'll add extra date parsing code to catch those date formats that mime4j doesn't recognize at the Tika level.

Again, thank you for noticing and for opening an issue on our JIRA.

Tim Allison
  • 615
  • 3
  • 10