0

Note: Just to clear the confusion, I have a parsed XML as String that I would like to apply regex against. Mention of XML in my question simply refer to parsed XML string.

I have a XML string processed (PARSED) by Java 7's TransformerFactory with indentation (i.e. eq=4) enabled. I need to replace all the whitespaces (in group of 4) before the xml tag with a tab (i.e. 1 tab = 4 whitespace, if 8 whitespace then 2 tabs and so on).

The objective is to make sure that the regex do not match the value of the attributes XML tag. Some XML tag's attributes contain one or more whitespaces. So far, all regex+es that I have tried either match all whitespaces or none. I have even tried some +ve/-ve lookahead/lookbehind and no luck (not good with regex).

As shown below the sample regex matches all whitespaces

enter image description here

I have tried a bunch of regex expressions

( {4}) //matches everywhere
^(\s{4})+ //for 12 whitespace, the first 8 is full match, not good
(?<![\d])( {4}) //only -ve/lookbehind 1 space not enough

Here is the https://regex101.com/r/VR4Nbf/2 for regex101

The TransformerFactory config is as follow:

transformer = transformerFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.METHOD, "xml");
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");

The above Transformer configuration work as expected. There is no way to tell Transformer to use tab instead of whitespace unless you override the required methods which seems to be overkill if regex is possible.

While I do understand that the in XML specification whitespace is considered okay, the way the XML file is used in my case requires the XML to be beautified with tabs (not whitespaces).

An ideal regex would

  1. Do not match anything inside xml tag
  2. Match each occurrence of 4 whitespace with 1 tab (i.e. I used replaceAll)
  3. Is Java-based
  4. Preferably can be used with replaceAll
  5. Have to be applied once, rather than repetitively (irrespective of level of nesting)

Note: Making use of XSLT is not feasible at this stage.

Thanks.

Raf
  • 7,505
  • 1
  • 42
  • 59
  • `^(?: {4})+` doesn't work for you? – CAustin Oct 21 '17 at 00:42
  • Nope. I need a regex that match 4 whitespace before opening tag. I feed such a regex pattern to ```xmlStr.replaceAll("pattern", "\t");``` and this way I will be able to preserve the indentation and convert all pre-tag whitespaces to tab. – Raf Oct 21 '17 at 02:22
  • Don't parse XML using regex; use a real XML parser. Matching attribute values via regex after XML is parsed is fine, but not before. See [**How to retrieve element value of XML using Java?**](http://stackoverflow.com/questions/4076910/how-to-retrieve-element-value-of-xml-using-java) or [**How to read XML using XPath in Java**](http://stackoverflow.com/questions/2811001/how-to-read-xml-using-xpath-in-java) – kjhughes Oct 21 '17 at 02:39
  • Where in the question have I stated that I am parsing XML using regex? I have a parsed XML as string that I would like to use regex against. I don't see how the two links you provided are relevant to my question? I don't see how this question is duplicate? Can you give some more details on why you marked it as duplicate please if you don't mind? Cheers. – Raf Oct 21 '17 at 04:10
  • 1
    I see the problem, and I think I've got something for you. Try ` {4}(?= *<)`. Note that the first character is a space. – CAustin Oct 22 '17 at 03:23
  • @CAustin thanks for the pattern. Works like a charm. I found out this pattern ```(?:\G|^) {4}``` which does the job too but, yours is more simple and readable. – Raf Oct 23 '17 at 13:55

0 Answers0