1

I'm trying to get the strings between 2 tags in an XML file adapting a solution I found in here.

This is the batch file I've:

@echo off
setlocal EnableDelayedExpansion

(for /F "delims=" %%a in ('findstr /I /L "<Name>" contacts.xml') do (
   set "line=%%a
   set "line=!line:*<Name>=!"
   for /F "delims=<" %%b in ("!line!") do echo %%b
)) > list.txt

Now when the XML is formatted I get all the names

<List>
   <Contacts>
      <Row>
         <Name>Carlos</Name>
         <Path>\Some\path\1</Path>
         <Hidden>False</Hidden>
      </Row>
      <Row>
         <Name>Fernando</Name>
         <Path>\Some\path\2</Path>
         <Hidden>False</Hidden>
      </Row>
      <Row>
         <Name>Luis</Name>
         <Path>\Some\path\3</Path>
         <Hidden>False</Hidden>
      </Row>
      <Row>
         <Name>Daniel</Name>
         <Path>\Some\path\4</Path>
         <Hidden>False</Hidden>
      </Row>
   </Contacts>
</List>

Carlos

Fernando

Luis

Daniel

But when the XML(This is how it's generated) is in 1 line I only get the first name

<List><Contacts><Row><Name>Carlos</Name><Path>\Some\path\1</Path><Hidden>False</Hidden></Row><Row><Name>Fernando</Name><Path>\Some\path\1</Path><Hidden>False</Hidden></Row><Row><Name>Luis</Name><Path>\Some\path\1</Path><Hidden>False</Hidden></Row><Row><Name>Daniel</Name><Path>\Some\path\1</Path><Hidden>False</Hidden></Row></Contacts></List>

Carlos

What changes should I make to the batch file so it correctly parse unformatted XML files?

Community
  • 1
  • 1
Carlos Escalera Alonso
  • 2,333
  • 2
  • 25
  • 37
  • Drop `for`, and treat your file as single line (it doesn't matter if it is or not using a single environment variable and `goto` to repeat until you find ``. BTW after debates if regex may/should be used to parse HTML/XML...we're moving to next step...batch files! ;) LOL To be serious you can do this only in very well controlled situations. An empty name may be ``, string may contain `&character;` or there may be `CDATA` (at least). – Adriano Repetti Mar 26 '15 at 12:13
  • Hi Adriano thanks for the explanation but I'm gonna need a little bit more of detail I'm not very skilled with this topic. – Carlos Escalera Alonso Mar 26 '15 at 13:59
  • Here for splitting: http://stackoverflow.com/a/14621732/1207195 and to read file http://stackoverflow.com/q/15481078/1207195 (just two examples! not necessarily best ones). Note they don't address other _issues_ I mentioned so you may need to do it by hand. As option...did you consider a vbscript/jscript script invoked from your batch file? – Adriano Repetti Mar 26 '15 at 14:06
  • Thanks for the resources Adriano, as for now I can't use any other tool than native batch in this system. – Carlos Escalera Alonso Mar 27 '15 at 09:34

3 Answers3

4

As Adriano implied in his comment, parsing XML via a powerful tool like regular expressions is frowned upon. Parsing XML with batch is far worse.

Pure, native batch cannot work with lines of text longer than 8191 bytes unless you use extraordinary techniques involving the FC command - trust me, you don't want to go there. There is no reason to expect an XML file to be smaller than 8191 bytes, so the short answer is essentially - you cannot parse unformatted XML that exists as one continuous line using native batch commands.

I have written a script based regular expression utility for batch called JREPL.BAT. It is a hybrid JScript/batch script that runs natively on any Windows machine from XP onward. I recommend putting JREPL.BAT in a folder (I use c:\utils) and then include that folder in your PATH variable.

The following JREPL.BAT command can be used to parse out your names under most simple scenarios, assuming you never have nested <Name> elements. But like any regular expression "solution", this code is not robust for all situations.

jrepl "<Name>([\s\S]*?)</Name>" "$1" /m /jmatch /f "contacts.xml" /o "list.txt"

Since JREPL is a batch script, then you must use CALL JREPL if you want to use the command within another batch script.

dbenham
  • 127,446
  • 28
  • 251
  • 390
  • Thanks for the excellent explanation seems that I won't be able to do it with pure batch. For now I can't add any utility to the system but I'll try your solution as soon as I can. – Carlos Escalera Alonso Mar 28 '15 at 00:13
3

Before I answer, I should point out that your single-line XML is missing a </Row> close tag, and all <Name> elements contain Carlos. So, in testing my answer, I used the following XML:

<List><Contacts><Row><Name>Carlos</Name><Path>\Some\path\1</Path><Hidden>False</Hidden></Row><Row><Name>Fernando</Name><Path>\Some\path\1</Path><Hidden>False</Hidden></Row><Row><Name>Luis</Name><Path>\Some\path\1</Path><Hidden>False</Hidden></Row><Row><Name>Daniel</Name><Path>\Some\path\1</Path><Hidden>False</Hidden></Row></Contacts></List>

Whenever you're manipulating or extracting data from XML or HTML, I think it's generally preferable to parse it as XML or HTML, rather than trying to scrape bits of text from it. Regardless of whether your XML is beautified or minified, if you parse XML as XML, your code still works. The same can't be said for regexp or token searches.

Pure batch doesn't handle XML all that well. But Windows Scripting Host does. Your best bet would be to employ JScript or VBscript, or possibly PowerShell. My solution is a batch + JScript hybrid script, employing the Microsoft.XMLDOM COM object and an XPath query to select the text child nodes of all the <Name> nodes -- basically, selectNodes('//Name/text()').

Save this with a .bat extension and salt to taste.

@if (@CodeSection == @Batch) @then

@echo off
setlocal

set "xmlfile=test.xml"

for /f "delims=" %%I in ('cscript /nologo /e:JScript "%~f0" "%xmlfile%"') do (
    echo Name: %%~I
)

rem // end main runtime
goto :EOF

@end
// end batch / begin JScript chimera

var DOM = WSH.CreateObject('Microsoft.XMLDOM');

with (DOM) {
    load(WSH.Arguments(0));
    async = false;
    setProperty('SelectionLanguage', 'XPath');
}

if (DOM.parseError.errorCode) {
   WSH.Echo(DOM.parseError.reason);
   WSH.Quit(1);
}

for (var d = DOM.documentElement.selectNodes('//Name/text()'), i = 0; i < d.length; i++) {
    WSH.Echo(d[i].data);
}
rojo
  • 24,000
  • 5
  • 55
  • 101
  • Very nice! A proper, robust way to do this. Much better than using regular expressions as I did. – dbenham Mar 26 '15 at 16:14
  • great. There are a lot of questions regarding parsing/editing XMLs. A tool for common usage will ease the life of a lot of people.Planning to create one but lately I have no much free time. `MSXML2.XMLHTTP` objects are also considerable option but the DOM parser is better . – npocmaka Mar 26 '15 at 21:10
  • Thanks Rojo, I fixed the 1 line XML. This seems like a good solution, I only know basics about this. but I'll try it, excellent description of the problem. – Carlos Escalera Alonso Mar 27 '15 at 09:32
  • @carlos: Just curious, did you try this script? What was it that ultimately led you to choose Aacini's solution? – rojo Mar 27 '15 at 12:22
  • Well the problem is that I can't use anything else in this system but pure batch, and Aacini's solution worked fine, I'll try with yours as soon as I can since it seems like a proper way to do it. – Carlos Escalera Alonso Mar 28 '15 at 00:16
  • I see. Yeah, give it a shot. Unless you're running Windows 95 or NT 4.0 without the Windows Script Host update, it ought to work. – rojo Mar 28 '15 at 00:19
2

Batch files are strongly tied to the format of the data to process. If the data changes, usually a new Batch file is required. The pure Batch file below extract the names of your example unformatted xml file as long as the line be less than 8190 characters.

@echo off
setlocal EnableDelayedExpansion

for /F "delims=" %%a in (contacts.xml) do (
   set "line=%%a"
   for %%X in (^"^
% Do NOT remove this line %
^") do for /F "delims=" %%b in ("!line:>=%%~X!") do (
      if /I "!field!" equ "<Name" for /F "delims=<" %%c in ("%%b") do echo %%c
      set "field=%%b"
   )
)

EDIT: Some explanations added

This solution uses an interesting trick that consist in replace a character in a string by a line feed (ASCII 10) character and then pass the result into a for /F command. In this way, the parts of the original string delimited by such a char are processed as individual lines.

This is the simplest example of such a method:

@echo off
setlocal EnableDelayedExpansion

set "line=Line one|Line two|Line three|Line four"

for %%X in (^"^
% Do NOT remove this line %
^") do for /F "delims=" %%b in ("!line:|=%%~X!") do echo %%b
)

The first for %%X is the way to assign a Line Feed character into %%X replaceable parameter. After that, !line:|=%%~X! part is used to replace each | character by a line feed. Finally, the second for /F command process the resulting lines in the usual way.

Aacini
  • 65,180
  • 12
  • 72
  • 108