2

I have an email client in Django. Currently supporting GMail accounts using imaplib.

My problem is: I want to obtain the attachment names without having to download the full email. Currently, in order to obtain the attachment names, or the email body, I need to download the whole email using the fetch function with the parameter (RFC822).

I know I can obtain specific fields only using HEADER.FIELDS, for the subject, from, cc for example. But is there a way to obtain the attachment names or the email body without downloading the whole email?

What I mean specifically is: let's say I have a 30Mb email that has one line of text in the body and two 15Mb attachments. I want to obtain the attachment names and that line of text without downloading the full 30Mb body.

Thank you

  • "...way to obtain [] email body without downloading the whole email" -> without the attachment you mean? – RickyA Dec 12 '12 at 20:17
  • Do you know about `BODYSTRUCTURE` and `(BODY ENVELOPE)`? – abarnert Dec 12 '12 at 20:17
  • 1
    @JonClements: No, an IMAP server can parse the body for you and return the parts you want. You have to be able to parse the `BODYSTRUCTURE` to know what to ask for, but then you can do it. – abarnert Dec 12 '12 at 20:19
  • @abarnert Yes, I posted and then realised there was something that could be done, about the same time you posted - so thought I'd remove it :) – Jon Clements Dec 12 '12 at 20:21
  • Edited the question to detail the problem better. Will investigate BODYSTRUCTURE and (BODY ENVELOPE). –  Dec 12 '12 at 20:25
  • Please se my edited answer. – evading Dec 12 '12 at 20:47

2 Answers2

3

Assuming you're asking what I think you're asking, here's what to do:

First, fetch the BODYSTRUCTURE. Assuming gmail's IMAP server supports this, you'll get back something like this:

(("TEXT" "PLAIN" ("CHARSET" "UTF-8") NIL NIL "QUOTED-PRINTABLE" 56 1 NIL NIL NIL NIL)
 ("TEXT" "HTML" ("CHARSET" "UTF-8") (NAME "") NIL NIL "BASE64" 12345 NIL 
  ("attachment" ("FILENAME" "")) NIL NIL) 
 ("IMG" "JPEG" (NAME "funny picture") NIL NIL "BASE64" 56789 NIL
  ("attachment" ("FILENAME" "image.jpg")) NIL NIL))
 "MIXED" ("BOUNDARY" "----_=_NextPart_001_1234ABCD.56789EF0") NIL NIL NIL)

And then fetch the (BODY ENVELOPE) is the structure has one.

If you look at RFC3501 7.4.2, it explains how to deal with these.

Once you've determined that the (BODY[1]) and (BODY[2]) are the plain-text and HTML versions of the main content, and (BODY[3]) is the first real attachment, you download the plain-text body by fetching (BODY[1]), and you've got the name of the attachment from the structure.

Sorry there's no code here. I don't think either imaplib or any of the stdlib MIME- and mail-related modules will do the hard part for you (interpreting the structure), but I haven't actually checked, so I'd look there first, and, if not, go to PyPI to see if anyone else has already written the code.

Well, actually, first I'd just fetch BODYSTRUCTURE, (BODY ENVELOPE) and (BODY[3]) for a specific message to make sure gmail has complete support before writing a whole mess of code…

PS, if worst comes to worst, if your use case is as simple and rigid as you described, you can just always fetch BODYSTRUCTURE and (BODY[1]), fall back to RFC822 if that fails, and get the attachment names by running a hacky regexp on the structure instead of a real parse. I wouldn't write this for anything but a one-shot script or a quick&dirty prototype to learn about gmail, but for those cases, I probably would.

Community
  • 1
  • 1
abarnert
  • 354,177
  • 51
  • 601
  • 671
  • Thank you, this seems exactly what I want. Will test in the next 5 minutes and mark as right answer if it works. –  Dec 12 '12 at 20:40
  • Accepted. Thank you for the effort –  Dec 12 '12 at 21:49
3

[Edit]

Ok here we go =)

>>> import imaplib, email
>>> mail = imaplib.IMAP4_SSL('imap.gmail.com')
>>> mail.login('emailaddr@gmail.com', 'password')
('OK', ['emailaddr@gmail.com Inget Namn authenticated (Success)'])
>>> mail.select('inbox')
('OK', ['14'])
>>> result, data = mail.uid('search', None, 'ALL')
>>> uids=data[0].split()
>>> result, data = mail.uid('fetch', uids[-1], 'BODYSTRUCTURE')
>>> print data
['14 (UID 340 BODYSTRUCTURE ((("TEXT" "PLAIN" ("CHARSET" "ISO-8859-1") NIL NIL "7BIT" 17 1 NIL NIL NIL)("TEXT" "HTML" ("CHARSET" "ISO-8859-1") NIL NIL "7BIT" 17 1 NIL NIL NIL) "ALTERNATIVE" ("BOUNDARY" "20cf3071d16a5a877b04d0adcc43") NIL NIL)("APPLICATION" "PDF" ("NAME" "attiny40.pdf") NIL NIL "BASE64" 8429956 NIL ("ATTACHMENT" ("FILENAME" "attiny40.pdf")) NIL) "MIXED" ("BOUNDARY" "20cf3071d16a5a878104d0adcc45") NIL NIL))']
>>>

The attachement for this message is called "attiny40.pdf" and you can clearly see that name in the BODYSTRUCTURE. All that is left is parsing that BODYSTRUCTURE.

The code is pretty much taken straight from the last link below.

[/Edit]

You will need to change the parameter for fetch from RFC822 to BODYSTRUCTURE.

And then as described here for example.

For example, a two part message consisting of a text and a BASE64-encoded text attachment can have a body structure of: (("TEXT" "PLAIN" ("CHARSET" "US-ASCII") NIL NIL "7BIT" 1152 23)("TEXT" "PLAIN" ("CHARSET" "US-ASCII" "NAME" "cc.diff") "960723163407.20117h@cac.washington.edu" "Compiler diff" "BASE64" 4554 73) "MIXED")

See also this post and this one. The last link looks like pretty much as what you are trying to do.

Community
  • 1
  • 1
evading
  • 3,032
  • 6
  • 37
  • 57
  • 1
    Very good and complete answer, but have to accept the other poster's answer because it was around 10 minutes earlier and is equally complete. But thank you for the extensive effort. This is equally 100% what I wanted as well. –  Dec 12 '12 at 21:49