0

I have written the code below to read XML files (file_1.xml and file_2.xml) and to extract the string between tags and to write it down into a TXT file. The issue is that some strings include double quotation marks and the program then takes these characters as being proper instructions (not part of the strings)...

Content of file_1.xml :

<AAA>C086002-T1111</AAA>
<AAA>C086002-T1222 </AAA>
<AAA>C086002-TR333 "</AAA>
<AAA>C086002-T5444  </AAA>

Content of file_2.xml :

<AAA>C086002-T5555 </AAA>
<AAA>C086002-T1666</AAA>
<AAA>C086002-T1777 "</AAA>
<AAA>C086002-T1888          "</AAA>

My code :

@echo off

setlocal enabledelayedexpansion

for /f "delims=;" %%f in ('dir /b D:\depart\*.xml') do (

    for /f "usebackq delims=;" %%z in ("D:\depart\%%f") do (

        (for /f "delims=<AAA></AAA> tokens=2" %%a in ('echo "%%z" ^| Findstr /r "<AAA>"') do (

            set code=%%a
            set code=!code:""=!
            set code=!code: =!
            echo !code!

        )) >> result.txt
    )
)

I get this in result.txt :

C086002-T1111
C086002-T1222
C086002-T5444
C086002-T5555
C086002-T1666

In fact, 3 out of the 8 lines are missing. These lines include double quotation marks or follow lines that include double quotation marks...

How can I deal with these characters and consider them as parts of the strings ?

wiltomap
  • 3,933
  • 8
  • 37
  • 54
  • Ouch. Why the down vote? It may not be good code. You can argue one shouldn't use batch to parse XML. But the question seems well thought out, and reasonably well stated. The OP obviously took time to self help and was able to diagnose the problem, but couldn't find a solution. It seems like a good question to me. – dbenham Nov 03 '14 at 16:46
  • @dbenham : because http://stackoverflow.com/questions/26676043/dos-batch-append-xml-tags-in-unique-txt-file – Magoo Nov 03 '14 at 16:51

2 Answers2

2

Please note - parsing XML with batch is a risky business because XML generally ignores white space. Any script you write could probably be broken by simply reformatting the XML into another equivalent valid form. That being said...

I haven't traced the problem through to fully explain your observed behavior, but the unbalanced quote is causing a problem with this line:

(for /f "delims=<AAA></AAA> tokens=2" %%a in ('echo "%%z" ^| Findstr /r "<AAA>"') do (

You can eliminate that problem and get your code to sort of work by eliminating any quotes before-hand.

@echo off

setlocal enabledelayedexpansion
del result.txt
for /f "delims=;" %%f in ('dir /b D:\depart\*.xml') do (
  for /f "usebackq delims=;" %%z in ("D:\depart\%%f") do (
    set code=%%z
    set code=!code:"=!
    set code=!code: =!
    (for /f "delims=<AAA></AAA> tokens=2" %%a in ('echo "!code!" ^| Findstr /r "<AAA>"') do (
      echo %%a
    )) >> result.txt
  )
)

But you have a potential major problem. DELIMS does not specify a string - it specifies a list of characters. So your DELIMS=<AAA></AAA> is equivalent to DELIMS=<>/A. If your element value ever has an A or / in it, then your code will fail.

There is a much better way:

First off, you can use FINDSTR to collect all your <AAA>----</AAA> lines from all files in one pass, without any loop:

findstr /r "<AAA>.*</AAA>" "D:\depart\*.xml"

Each matching line will be output as the file path, followed by a colon, followed by the matching line, as in:

D:\depart\file_1.xml:<AAA>C086002-T1111</AAA>

The file path can never contain <, or >, so you can use the following to iterate the result, capturing the appropriate token:

for /f "delims=<> tokens=3" %%A in ( ...

Finally, you can put parentheses around the entire loop, and redirect just once. I'm assuming you want each run to create a new file, so I use > instead of >>.

@echo off
setlocal enabledelayedexpansion
>result.txt (
  for /f "delims=<> tokens=3" %%A in (
    'findstr /r "<AAA>.*</AAA>" "D:\depart\*.xml"''
  ) do (
    set code=%%A
    set code=!code:"=!
    set code=!code: =!
    echo(!code!
)

Assuming that you only need to trim leading or trailing spaces/quotes, then the solution is even simpler. It does require odd syntax to specify a quote as a DELIM character. Note that there are two spaces between the last ^ and %%B. The first escaped space is taken as a DELIM character. The unescaped space terminates the FOR /F options string.

@echo off
>result.txt (
  for /f "delims=<> tokens=3" %%A in (
    'findstr /r "<AAA>.*</AAA>" "D:\depart\*.xml"'
  ) do for /f delims^=^"^  %%B in ("%%A") do echo(%%B
)

UPDATE in response to comment

I'm assuming your data value will never contain a colon.

If you want to append source file name to each line of output, then you simply need to alter the first FOR /F to capture the first token (the source file) as well as the third token (the data value). The file will contain the full path as well as a trailing colon. The second FOR /F appends the file to the source data string using the ~nx modifier to get just the name and extension (no drive or path), and a colon is added to the DELIMS option so the trailing colon is trimmed off.

@echo off
>result.txt (
  for /f "delims=<> tokens=1,3" %%A in (
    'findstr /r "<AAA>.*</AAA>" "D:\depart\*.xml"'
  ) do for /f delims^=:^"^  %%C in ("%%B;%%~nxA") do echo %%C
)
dbenham
  • 127,446
  • 28
  • 251
  • 390
  • Thank you very much @dbenham ! Your code looks pretty good and concise. I would just need to echo the name of each XML file parsed following the `echo (%%B)`. How could I include that to your code ? Example of content in result.txt : `C086002-T346;86002_2014_1.xml`. Thanks for your advice ! – wiltomap Nov 04 '14 at 08:20
0

If I keep @dbenham suggestion and I complete it in order to echo the filename :

@echo off
>result.txt (
    for /f %%f in ("D:\depart\*.xml") do (
        for /f "delims=<> tokens=3" %%A in ('findstr /r "<AAA>.*</AAA>" "D:\depart\*.xml"') do (
             for /f delims^=^"^  %%B in ("%%A") do (
               echo %%B;%%f
             )
         )
     )
 )

Thanks for your opinion on this code !

wiltomap
  • 3,933
  • 8
  • 37
  • 54
  • No, that can't work. You can either iterate all files and call FINDSTR for each individual file, or you can let FINDSTR search all files in one step. But you shouldn't iterate all the files, and then repeatedly have FINDSTR search all the files. – dbenham Nov 04 '14 at 12:07
  • OK. So how could I get filename in the final `echo` ? I'm looking for a solution but can't find it for now... – wiltomap Nov 04 '14 at 12:27
  • What do you think about replacing `D:\depart\*.xml` inside the `findstr` by `%%f` ? This should be more logical... – wiltomap Nov 04 '14 at 12:30