0

I need to parse XML files that have a number of invalid characters in them. Here is the VB6/VBA code I use to parse a file and replace the invalid characters:

Dim xmldoc As MSXML2.DOMDocument
Dim xmlNode As MSXML2.IXMLDOMNode
Dim xmlNodeList As MSXML2.IXMLDOMNodeList
dim XML as string
dim fno as integer

' get the XML file
fno = FreeFile
Open "input.xml" For Input As #fno
XML = Input(LOF(fno), fno)
Close #fno

TOP_OF_CODE:
Set xmldoc = New MSXML2.DOMDocument60
xmldoc.LoadXML XML
Set xmlNodeList = xmldoc.getElementsByTagName("*")
For Each xmlNode In xmlNodeList

    (a bunch of code to parse the XML)

Next xmlNode

If xmldoc.parseError.errorCode <> 0 And xmldoc.parseError.reason = "An invalid character was found in text content." & vbCrLf Then
    ' invalid character was found
    ptr = xmldoc.parseError.filepos
    XML = Left(XML, ptr - 1) & "x" & Mid(XML, ptr + 1)
    set xmldoc = Nothing
    GoTo TOP_OF_CODE
end if

Much of the time the code works exactly as intended: each of the invalid characters is removed iteratively and then the parsing takes place. Sometimes, however, things seem to get "stuck": each time it detects an invalid character at the same position even after I've replaced the invalid character with a valid one. I have tried inserting various characters to replace the invalid one, and have also simply deleted that character position. I still get an invalid character error at the same place. Any clues?

  • 2
    That doesn't look like the real code. For example, you `set xmldoc = Nothing` and then `GoTo TOP_OF_CODE`. But the first thing that happens at TOP_OF_CODE is `xmldoc.LoadXML XML`, which would result in an "Object or With block not set". Please post a better sample. – tcarvin Aug 01 '12 at 19:50
  • Please replace `(a bunch of code to parse the XML)` with the code you are using to replace invalid characters. – JimmyPena Aug 01 '12 at 20:26
  • In paring down the sample for posting I left out some essential code. My apologies. – user1569251 Aug 01 '12 at 20:35
  • Okay, I've edited the code and I think everything is there now. – user1569251 Aug 01 '12 at 20:42
  • 1
    Where did you get this file? The way you are reading it assumes that it is ANSI data encoded for your current system locale and codepage. Have you considered that the actual encoding in the file might be UTF-8? If so then reading it the way you are will be unreliable, working "much of the time" but failing when non-ASCII symbols are encountered. – Bob77 Aug 02 '12 at 01:32
  • 1
    I'd also try hard to get out of the habit of using those slow Variant String functions. Doing the job right costs very little effort. – Bob77 Aug 02 '12 at 01:34

1 Answers1

1

I wouldn't open the file "As Input". Instead, I would open it "As Binary", allocating a buffer: Redim abytData(1 To Lof(fno)), and using Get #fno, , abytData() to pull the data into the buffer. This means that VB won't do any processing on the data. You should then use the various "B" byte-based versions of the string functions, such as InStrB(), to process the data.

I would then try to do as much pre-processing as I could to remove the invalid characters before parsing the XML, rather than relying on the XML parser to do it, which is an inefficient mechanism.

Can you give an example of what invalid characters you are finding?

Mark Bertenshaw
  • 5,594
  • 2
  • 27
  • 40