-1

I have an XML file with no double quotes to the attribute values. The following is a sample. As you can see, these are the possible values and I tried using the regex *=\s*([^" >]+) and replace with ="\1" which works for the most part but it has two issues.
Any help on these will be appreciated.

  1. It doesn't replaces the empty values (eg.status) with double quotes("").
  2. It replaces the first word only when the value has a sentence.eg(description)

Sample input:

<tool id=2 code=abc description=my description end here my_levels=$15,000/$30,000 individual_level= amount=0 status= my_code=P my_date=2017-02-21T00:00:00 points= />

expected result:

<tool id="2" code="123abc" description="my description end here" my_levels="$15,000/$30,000" individual_level="" amount="0" status="" my_code="P" my_date="2017-02-21T00:00:00" points="" />
zx485
  • 28,498
  • 28
  • 50
  • 59
KKR
  • 79
  • 8
  • You are probably not going to be able to solve this with regex. The generation of the invalid XML has discarded some information. There are unresolvable ambiguities since attribute values could conceivably contain equals-sign characters (you cannot be sure they don't). The only rational solution is to fix the generation of the XML at the source, which is where the attribute values are known unambiguously. – Jim Garrison Feb 23 '17 at 20:13
  • I am sure that we dont have = to values in xml. – KKR Feb 23 '17 at 22:57

1 Answers1

1

This may be beyond regex, but as long as you definitely don't have any equals symbols in your values the following should work:

Search: \b(\w+)=((?:\s*[^=>]+\b(?!=))+)?(\s+|\/?>)

Replace: $1="$2"$3

  • \b matches a word boundary http://www.regular-expressions.info/wordboundaries.html
  • (\w+) matches one or more word characters and captures as 'group 1' - referenced in the replace as $1
  • ( start 'group 2' - referenced in the replace as $2
    • (?: start a group, but do not capture - we do this so we can use the + char to repeat at the end
      • \s* matches zero or more whitespace characters
      • [^=>]+ matches one or more characters that are not = or >
      • \b matches another word boundary - without this it will continue matching part of the next property
      • (?!=) makes sure that the next character is not = This is known as a negative lookahead - be careful with these, they are a good way to make regex inefficient. http://www.regular-expressions.info/lookaround.html
    • )+ closes the non capturing group, and match it one or more times
  • )? closes group 2 and make it optional using the ? character
  • (\s+|/?>) make sure it ends with whitespace or the end of a tag - capture this as 'group 3' - use in replace as $3
    • \s+ whitespace or
    • /? optional forward slash for self closing tags
    • > end of tag

See it in action here: https://regex101.com/r/zYdzQB/2

Some caveats:

  • You will need to carefully check the results
  • You should not automate this, it is not an efficient way of solving the problem, but if you have a broken file to fix then it may be suitable.
  • If you have any chance of reviewing how the data was generated and fixing this you would be much better off doing that.
Theo
  • 1,608
  • 1
  • 9
  • 16