24

The problem is this: I have an XML fragment like so:

<fragment>text1 <a>inner1 </a>text2 <b>inner2</b> <c>t</c>ext3</fragment>

For the result, I want to remove all <a>- and <c>-Tags, but retain their (text)-contents, and childnodes just as they are. Also, the <b>-Element should be left untouched. The result should then look thus

<fragment>text1 inner<d>1</d> text2 <b>inner2</b> text3</fragment>

For the time being, I'll revert to a very dirty trick: I'll etree.tostring the fragment, remove the offending tags via regex, and replace the original fragment with the etree.fromstring result of this (not the real code, but should go something like this):

from lxml import etree
fragment = etree.fromstring("<fragment>text1 <a>inner1 </a>text2 <b>inner2</b> <c>t</c>ext3</fragment>")
fstring = etree.tostring(fragment)
fstring = fstring.replace("<a>","")
fstring = fstring.replace("</a>","")
fstring = fstring.replace("<c>","")
fstring = fstring.replace("</c>","")
fragment = etree.fromstring(fstring)

I know that I can probably use xslt to achieve this, and I know that lxml can make use of xslt, but there has to be a more lxml native approach?

For reference: I've tried getting there with lxml's element.replace, but since I want to insert text where there was an element node before, I don't think I can do that.

Thor
  • 373
  • 1
  • 2
  • 7

2 Answers2

41

Try this: http://lxml.de/api/lxml.etree-module.html#strip_tags

>>> etree.strip_tags(fragment,'a','c')
>>> etree.tostring(fragment)
'<fragment>text1 inner1 text2 <b>inner2</b> text3</fragment>'
Percival Ulysses
  • 1,133
  • 11
  • 18
Kabie
  • 10,489
  • 1
  • 38
  • 45
  • Thanks, this works perfectly. The term "strip" didn't occur to me, or I might've found the answer myself :) – Thor Jan 13 '11 at 15:27
  • 1
    Also awesome: ``etree.strip_elements(fragment, *['tag1', 'tag2'])`` – mkelley33 Mar 01 '11 at 04:03
  • 1
    Exactly what I sought. Even better, strip_tags() accepts wildcards so that passing `"*"` as a tag removes all tags from the tree. Completely. – Jens Apr 06 '13 at 02:29
3

Use Cleaner function of lxml to remove tags from html content. Below is an example to do what you want. For an HTML document, Cleaner is a better general solution to the problem than using strip_elements, because in cases like this you want to strip out more than just the tag; you also want to get rid of things like onclick=function() attributes on other tags.

import lxml
from lxml.html.clean import Cleaner
cleaner = Cleaner()
cleaner.remove_tags = ['p']
remove_tags:

A list of tags to remove. Only the tags will be removed, their content will get pulled up into the parent tag.

RredCat
  • 5,259
  • 5
  • 60
  • 100
pjoshi
  • 310
  • 3
  • 9