2

Hi am trying to parse an XML file using REXML .... when there is an illegal character in my XML file ...its jus fails at this point.

So is there any way we could replace or remove these kind of characters ?

fails to parse with the error Illegal character '&' in raw string REXML parsing

<head> Negative test for underlying BJSPRICEENG N4&N5
</head>


doc = REXML::Document.new(File.open(file_name,"r:iso-8859-1:utf-8"))

testfile.elements["head"].text





doc = REXML::Document.new(content)
dir_path = doc.elements["TestBed/TestDir"].attributes["path"].to_s
    doc.elements.each("TestBed/TestDir") do |directory|
      directory.elements.each("file") do |testfile|

t= testfile.elements["head"].text

end
end
end




<file name="toptstocksensbybjs.m">
      <MCheck></MCheck>
      <TestExtension></TestExtension>
      <TestType></TestType>


<fcn name="lvlTwoDocExample" linenumber="20">
 <head> P1><&
</head>

 </fcn>

   </file>
Vinay
  • 237
  • 2
  • 8
  • 17

1 Answers1

11

For your case, to remove the illegal & characters, you may try:

content = File.open(file_name,"r:iso-8859-1:utf-8").read
content.gsub!(/&(?!(?:amp|lt|gt|quot|apos);)/, '&amp;')
doc = REXML::Document.new(content)

However, for those other illegal characters, especially those unpaired <, >, ' or ", it will be much more difficult.

Arie Xiao
  • 13,909
  • 3
  • 31
  • 30
  • 2
    @samuil Only these 5 in XML, not like that in HTML. – Arie Xiao Jun 21 '13 at 14:34
  • @ArieShaw Could you please explain what happens here exactly ..... just the & will be replaced by &amp ...... and how abt the other characters > < ' " within the string ? – Vinay Jun 21 '13 at 14:51
  • @Vinay The regular expression detects illegal `&` character(s) only. An `&` is legal if it is followed by `amp;`(&), `lt;`(<), `gt;`(>), `quot;`(") or `apos;`('). Here the negative look ahead `(?!PATTERN)` will filter out those valid `&` (a `&` that isn't followed by amp; or lt; or ...). Note that the (negative) look ahead group is a zero length prediction which will not consume characters. – Arie Xiao Jun 21 '13 at 15:00
  • @Vinay It will be difficult or may be impossible to detect illegal `<`, `>`, `'`, or `"`. However, in some special case this might be possible. – Arie Xiao Jun 21 '13 at 15:01
  • @ArieShaw Thanks so much for the explanation:) .... but is it possible like this i mean while stripping out head tag at this step to replace > or < ............at this step testfile.elements["head"].text ..... may be using REXML::Text.new( string, false, nil, false) ? – Vinay Jun 21 '13 at 15:06
  • I have updated my post with the code .... please have a look at it now .... and what I am trying to ask is that ......is there any way to replace the < or > while fetching the head tag at this line -- > t= testfile.elements["head"].text ..... like x=REXML::Text.new(t, false, nil, false) – Vinay Jun 21 '13 at 15:33
  • @Vinay so, you still didn't paste the relative part of the XML file. How would I know what `directory`, `testfile` are? – Arie Xiao Jun 21 '13 at 17:09
  • @ArieShaw sry ... updated the content of the xml in my post that i am trying to parse.please check it now! – Vinay Jun 21 '13 at 17:33
  • @Vinay As I have said, it's difficult or impossible. If you have special case, go find the pattern of the illegal `<`, `>`. – Arie Xiao Jun 21 '13 at 18:23