0

I can't import my big xml file (1,5g) into database. Then I use XMLReader->read() i have error where element have a ampersand. maybe you can help me where I convert invalid XML file to valid?

I use tidy, xmlsoft, sed on Windows 7 but this command line software breaks on limit memory error.

PHP:

$reader = new XMLReader();
$reader->open('sm.xml');

    while ($reader->read())
        {
        // check to ensure nodeType is an Element not attribute or #Text
            if ($reader->nodeType == XMLReader::ELEMENT)
                    {
                        if ($reader->localName == 'brand')
                                {
                                    $reader->read();
                                    $data['brand'] = $reader->value;
                                }
                        if ($reader->localName == 'number')
                                {
                                    $reader->read();
                                    $data['number'] = $reader->value;
                                }
                        if ($reader->localName == 'descr')
                                {
                                    $reader->read();
                                    $data['descr'] = $reader->value;
                                }

                        if ($reader->localName == 'price')
                                {
                                    $reader->read();
                                    $data['price'] = $reader->value;
                                }
                        if ($reader->localName == 'deadline')
                                {
                                    $reader->read();
                                    $data['deadline'] = $reader->value;
                                }
                        if ($reader->localName == 'rest')
                                {
                                    $reader->read();
                                    $data['rest'] = $reader->value;
                                }
            } //Checking if the </person>tag is reached.
            elseif($reader->nodeType == XMLReader::END_ELEMENT AND $reader->name == 'article')
                {

                    $sql = 'INSERT INTO tec (brand_name,brand_art,name_tov,cena,srok,kolvo) 
  VALUES ("'.$data['brand'].'","'.$data['number'].'","'.$data['descr'].'","'.$data['price'].'","'.$data['deadline'].'","'.$data['rest'].'");';
    $mysqli->query($sql);

                // Insert the content of array $data to database or some other action.
                //print_r($data);

                }
}

If this code read element <number>111&111</number> I have an error. I can remove this ampersand using a command line tool, but I have out of memory on very big xml file.

My example run:

xmllint.exe --recover --maxmem 10000000000 --noout --encode utf8 sm.xml -o smtt.xml
tidy.exe -m -utf8 -xml sm.xml
sed.exe 's/&/\&amp;/g; s/&amp;amp;/\&amp;/g; s/&amp;quot;/\&quot;/g;' sm.xml > smtt.xml <-- can't run

Maybe have other way use PHP XMLReader with skip validation?

halfer
  • 19,824
  • 17
  • 99
  • 186
Mike
  • 1
  • 1
  • Can we see your PHP code as well? Is the problem that you have an ampersand, or are you running out of memory? What is the exact error you get? – halfer May 05 '13 at 19:49
  • problem where i have ampersand, i cant escaped or skip this char? out of memory i have then revalidate a big xml file for prepare read for php xmlreader. Error php warning : XMLReader->read() 111&111 – Mike May 05 '13 at 19:55
  • Right, so you have two errors: one is that if you try to fix the invalid XML, you run out of memory, and the second is that if you don't fix the invalid XML, you get a reader error. The second one is expected, so you should try to fix your XML. 1. **Exactly** which command line utility ran out of memory? 2. Can you replace that with another utility? – halfer May 05 '13 at 21:46
  • xmllint and tidy have out of memory, sed.exe not run, but i find option for start this utility. i have not found another utility... i think this unix port help me, but two utility out of memory. if they can not find other ways to solve my problem - i write new my "xml reader" which will be read line by line from a file. – Mike May 05 '13 at 21:56
  • 1
    Who generated this bad XML? How bad is it: do you even know? Is fixing 1.5Gb of bad data really going to be easier than fixing the program that generated it? – Michael Kay May 05 '13 at 23:20
  • Maybe you could replace `xmllint` and `tidy` with a PHP script instead, by reading a bit of the file in, repairing that, writing the data out to a new file, etc. until the whole file is processed. Then that file will work fine with your XMLReader code. I imagine these two utilities are running out of memory because they are trying to load the whole file into RAM, plus a large working space to model the XML document. – halfer May 05 '13 at 23:30
  • halfer, Michael Kay thank your for support, I contacted the developer, who say me to validate your file and may deliver me from the use of an additional file processing, I hope that tomorrow I'll be all right, once again thank you for your recommendation. I think my question has already been closed and resolved. – Mike May 06 '13 at 12:44

1 Answers1

0

XMLMax editor (from xponentsoftware) will locate the error and allow you to fix it in its virtual text editor. 1.5 GB should be no problem.

Disclaimer: I am affiliated with the vendor.

user204427
  • 157
  • 7