0

this is not a duplicate question(How to delete “\r\n”, but only in certain sections of my text?) I'm looking for a more flexible solution;

this is not matter of a XML file; just for this sample it happened to be in xml form; it's a general question about how to implement the idea;

this is a sample of a very large text:

<nx1>home</nx1>
<nx2>living</nx2>
<out>text one
 text continues
 and at last!</out>
<m2>dog</m2>

How can I search for the <out>...</out> block and then within that block replace all \r\n characters with empty string(remove line breaks) and after that put the newly created text in place of the old <out>...</out> block?

so the result for this sample would be:

<nx1>home</nx1>
<nx2>living</nx2>
<out>text one text continues and at last!</out>
<m2>dog</m2>

How can I implement that in python using regex?

Community
  • 1
  • 1
wiki
  • 1,877
  • 2
  • 31
  • 47

3 Answers3

1

If you prefer using a regular expression, I would use a function as a replacement.

>>> import re
>>> def as_tag(match):
...     return match.group(1).replace('\r\n', '')
...
>>> text = '''
<nx1>home</nx1>
<nx2>living</nx2>
<out>text one 
 text continues
 and at last!</out>
<m2>dog</m2>
'''
>>> re.sub(r'(<out>[^<]*</out>)', as_tag, text)

Output

<nx1>home</nx1>
<nx2>living</nx2>
<out>text one text continues and at last!</out>
<m2>dog</m2>
hwnd
  • 69,796
  • 4
  • 95
  • 132
  • Why do you need the first half? Doesn't the second half catch the first? – bj0 May 27 '14 at 20:16
  • Interesting, I didn't realize you could use a callable as an argument for processing the match. – bj0 May 27 '14 at 20:20
1

Since this particular input is an XML, here's a solution involving lxml parser and normalize-space():

>>> from lxml import etree
>>> text = """
... <root>
...     <nx1>home</nx1>
...     <nx2>living</nx2>
...     <out>text one
...      text continues
...      and at last!</out>
...     <m2>dog</m2>
... </root>
... """
>>> tree = etree.fromstring(text)
>>> out = tree.find('out')
>>> out.text = out.xpath('normalize-space(text())')
>>> print etree.tostring(tree)
<root>
    <nx1>home</nx1>
    <nx2>living</nx2>
    <out>text one text continues and at last!</out>
    <m2>dog</m2>
</root>
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Yeah, dude I proposed Python-xml too. but the question OP saids the question isn't about xml. The text he posted is just an example of text nothing more. Also I got a doen vote I don't know why? :( – Raydel Miranda May 27 '14 at 19:50
  • @RaydelMiranda yeah, probably the OP should have provided a different example. regex solution for modifying `XML/HTML` structure is not a good direction. Downvote - probably because it looks like a link-only answer without a specific code sample that solves the OP's issue. Thanks. – alecxe May 27 '14 at 19:53
  • I see no reason this can't be solved with an XML library. You can still use an XML library to find the out tag, and then operate on the string inside. – limasxgoesto0 May 27 '14 at 20:35
  • @limasxgoesto0 you mean that instead of `normalize-space()` you can get rid of newlines differently? yup, `normalize-space()` is just one option. Thanks. – alecxe May 27 '14 at 20:50
  • Not quite what I mean. I just mean that though OP says it's not an xml problem, it's still best solved using an xml library. Once you get to the actual text, what how to get rid of the new line is up to you. – limasxgoesto0 May 27 '14 at 20:53
0

I recommend you to use Python The ElementTree XML API. Take a look at the section: Modifying an XML File

Raydel Miranda
  • 13,825
  • 3
  • 38
  • 60
  • it's not just matter of XML file; just for this sample literal patters are in xml form; it's a general question – wiki May 27 '14 at 19:43
  • 1
    Ok, but you must to specify that in your question. – Raydel Miranda May 27 '14 at 19:45
  • 1
    @wiki Generally, regular expressions aren't the best approach for handling irregular data, like XML and HTML. So you're going to find folks recommend using an XML/HTML parser to handle parsing/editing them. – dano May 27 '14 at 19:46