0

I'm attempting to learn XML in order to parse GChats downloaded from GMail via IMAP. To do so I am using lxml. Each line of the chat messages is formatted like so:

<cli:message to="email@gmail.com" iconset="square" from="email@gmail.com" int:cid="insertid" int:sequence-no="1" int:time-stamp="1236608405935" xmlns:int="google:internal" xmlns:cli="jabber:client">

<cli:body>Nikko</cli:body>

<met:google-mail-signature xmlns:met="google:metadata">0c7ef6e618e9876b</met:google-mail-             signature>

<x stamp="20090309T14:20:05" xmlns="jabber:x:delay"/>

<time ms="1236608405975" xmlns="google:timestamp"/>

</cli:message>

When I try to build the XML tree like so:

root = etree.Element("cli:message")

I get this error:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 2568, in lxml.etree.Element (src/lxml/lxml.etree.c:52878)
File "apihelpers.pxi", line 126, in lxml.etree._makeElement (src/lxml/lxml.etree.c:11497)
File "apihelpers.pxi", line 1542, in lxml.etree._tagValidOrRaise      (src/lxml/lxml.etree.c:23956)
ValueError: Invalid tag name u'cli:message'

When I try to escape it like so:

root = etree.Element("cli\:message")

I get the exact same error.

The header of the chats also gives this information, which seems relevant:

Content-Type: text/xml; charset=utf-8
Content-Transfer-Encoding: 7bit

Does anyone know what's going on here?

spikem
  • 173
  • 3
  • 13

2 Answers2

0

So this didn't get any response, but in case anyone was wondering, BeautifulSoup worked fantastically for this. All I had to do was this:

soup = BeautifulSoup(repr(msg_data))
print(soup.get_text())

And I got (fairly) clear text.

spikem
  • 173
  • 3
  • 13
  • I'm trying to do essentially what you did (I want to throw the chat logs into DayOne as journal entries), except I have the chat logs stored locally. Any chance of seeing your final script? – JudeOChop May 07 '13 at 22:45
0

So the reason you got an invalid tag is that if you were to look at the way lxml parses xml it doesn't use the namespace "cli" it would look instead like:

{url_where_Cli_is_define}Message

If you refer to Automatic XSD validation you will see what i did to simplify managing large amounts of schemas etc..

similarly what i did to avoid this very problem you would just replace the namespace using str.replace() to change the "cli:" to "{url}". having placed all the namespaces in one dictionary made this process quick.

I imagine soup does this process for you automatically.

Community
  • 1
  • 1
Jtello
  • 752
  • 1
  • 5
  • 17