1

Application have the string variable which contains xml data.
I trying to remove all tags <product_desc></product_desc> using Regex.
Here are the value of the string variable

<orderlines>
    <orderline>
        <id>1000001</id>
        <product_id>2004</product_id>
        <product_desc>ITEM2004
        Color: red
        Size: 150x10x10
        Material: iron
        </product_desc>
        <qnt>2</qnt>
    </orderline>
    <orderline>
        <id>1000002</id>
        <product_id>2012</product_id>
        <product_desc>ITEM2012</product_desc>
        <qnt>4</qnt>
    </orderline>
    <orderline>
        <id>1000003</id>
        <product_id>3000</product_id>
        <product_desc>DELIVERY</product_desc>
        <qnt>1</qnt>
    </orderline>
</orderlines>

When I using next pattern:

Dim pattern As String = "(<product_desc>[\s\S]*</product_desc>)"
Dim newvalue As String = Regex.Replace(originvalue, pattern, "")

I get result like this:

<orderlines>
    <orderline>
        <id>1000001</id>
        <product_id>2004</product_id>

        <qnt>1</qnt>
    </orderline>
</orderlines>

So problem is that Regex matches all values between first <product_desc> and last </product_desc> and replace them with empty string. This approach remove all <orederline> tags between them(check value of the <qnt> tag).

Can anybody give some tip of how limit removing to remove only specific tag. Content of the tag can contain all possible characters, newlines and even html code.

Fabio
  • 31,528
  • 4
  • 33
  • 72
  • Why do you want to remove the tags? Using an XmlDocument has got to be easier and less prone to error. – Matt Wilko Jun 25 '14 at 08:04
  • Problem is that XmlDocument.Load raises error because content of the tag `` contain invalid `Unicode` characters. Xml file application gets from outside. – Fabio Jun 25 '14 at 08:06
  • Are the invalid characters the CR and LF? in that case you can just do a `String.Replace` before loading the document – Matt Wilko Jun 25 '14 at 08:09
  • No, invalid characters are `html-entities` which Unicode cannot read properly. Error is: `Character , which hexadecimal value is 0x03, is invalid...` – Fabio Jun 25 '14 at 08:21

3 Answers3

2

Not an answer to your question but in response to your comments. You can use a method like this with XmlConvert.IsXmlChar to remove an invalid xml chars from a string, then use an XmlDocument to load it:

Public Shared Function RemoveInvalidXmlChars(xml As String) As String
    Dim validXmlChars = xml.Where(Function(x) XmlConvert.IsXmlChar(x)).ToArray()
    Return New String(validXmlChars)
End Function

Converted from this answer which has some other suggestions as well: How do you remove invalid hexadecimal characters from an XML-based data source prior to constructing an XmlReader or XPathDocument that uses the data?

Community
  • 1
  • 1
Matt Wilko
  • 26,994
  • 10
  • 93
  • 143
  • Thanks, this solution works. Sorry, because question was about `Regex` approaching, so I will accept a @zx81 answer – Fabio Jun 25 '14 at 08:49
  • 2
    Ironically, I going to use your solution :), even this is slower then `Regex`, but logically right approaching in `xml parsing` – Fabio Jun 25 '14 at 09:25
  • Good decision... Your code is more readable, and more easily adaptable. As an added bonus, once you've stripped illegal characters, do you really need to kill the element at all? Also, how are these invalid characters getting into your data in the first place - maybe there is something to look at there... – Martin Milan Jun 25 '14 at 09:40
  • @MartinMilan, after illegal characters was striped, I do not remove the element - you are right, no need anymore. In the first place, invalid data came from outside of application, so I cannot do anything to that. – Fabio Jun 25 '14 at 12:18
1

The problem: [\s\S]* is greedy

It matches every single char to the end of the string, then the engine backtracks to allow </product_desc> to match. Therefore, there is one single match from the first opening tag to the last closing tag.

The solution (if we're doing regex): a lazy quantifier

With all the warnings and disclaimers about using regex to parse xml... You can do this:

  • Adding a ? to a quantifier makes it "lazy", so that it matches only as many chars as necessary.
  • You can use .*? in DOTALL mode (as in the sample code below) or [\s\S]*? (but there is no point).

Sample code

Dim ResultString As String
Try
    ResultString = Regex.Replace(SubjectString, "(?s)<product_desc>.*?</product_desc>", "")
Catch ex As ArgumentException
    'Syntax error in the regular expression
End Try

Reference

zx81
  • 41,100
  • 9
  • 89
  • 105
0

I would use an XML API like Linq2Xml (XDocument and friends) to do this sort of thing. Why reinvent the wheel?

Martin Milan
  • 6,346
  • 2
  • 32
  • 44
  • But creating of `XDocument` raises error(check my comments under question). So it is not a solution in this case – Fabio Jun 25 '14 at 08:46