1

I want to extract words between ";" and ":" from an XML file, for example the word " Index" here

bla bla bla ; Index : bla bla

the file is loaded by its URL using file_get_contents

$output = file_get_contents("https://fr.wikipedia.org/wiki/Sp%C3%A9cial:Exporter/Base_de_donn%C3%A9es");
       
 preg_match_all('/\;.[a-zA-Z]+.\:/', $output, $matches, PREG_SET_ORDER, 0);
 var_dump($matches);

The regex pattern works fine on the same file content using regex101 and also when I copy the text in a string variable. But the code above does Not work, it returns only the last match.

What am I doing wrong ?

PS : I also tried loading the XML file using DOMDocument.. same result.

Community
  • 1
  • 1
lady_OC
  • 417
  • 1
  • 5
  • 20

1 Answers1

2

A way to do it with a low memory footprint, several considerations:

  • the file is big (not enormous but big).
  • the fact that your are dealing with an xml file isn't very important for this case since the text you are looking for follows it's own line based format (XWiki format for standard definitions) that is independent of the xml format. However, if you absolutely want to use an XML parser here to extract the text tag content, I suggest to use XMLReader in place of DOMDocument.
  • the lines you are looking for are always single lines, start with ; (without indentation) and are always immediately followed by : on the next line.

Once you see that (right click, source code), you can choose to read the file by line (instead of loading the whole file with file_get_contents) and to use a generator function to select interesting lines:

$url = 'https://fr.wikipedia.org/wiki/Sp%C3%A9cial:Exporter/Base_de_donn%C3%A9es';

$handle = fopen($url, 'rb');

function filterLines($handle) {
    while (feof($handle) !== true) {
        $line = fgets($handle);
        if ( $line[0] == ';' ) {
            $temp = $line;
            continue;
        } 
        if ( $line[0] == ':' && $temp )
            yield $temp;            

        $temp = false;
    }
}

foreach (filterLines($handle) as $line) {
    if ( preg_match_all('~\b\p{Latin}+(?: \p{Latin}+)*\b~u', $line, $matches) )
        echo implode(', ', $matches[0]), PHP_EOL;
}

fclose($handle);
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • "the lines you are looking for are always single lines, start with ; (without indentation) and are always immediately followed by : on the next line." - this doesn't appear to be correct. – Scott Weaver Jun 03 '17 at 18:34
  • 2
    @sweaver2112: you are not looking to the code source (what you see is the xml with the default style of your browser): right click and display the code source. – Casimir et Hippolyte Jun 03 '17 at 18:36
  • but even with a start-of-line anchor, all patterns I've tried timeout on regex101.com with python (all others work just fine). gotta be a bug in `re` here ?? – Scott Weaver Jun 03 '17 at 18:52
  • I don't know, I haven't seen your patterns, note also that the string is very large. This one works with pcre: https://regex101.com/r/CQy2wj/1/ – Casimir et Hippolyte Jun 03 '17 at 19:04
  • 1
    I don't have to use an XML parser.. just mentioned the fact that I tried it, since I couldn't figure out the problem.. Thank you so much for the clear answer :) it works fine. – lady_OC Jun 03 '17 at 21:33