1

I'm looking around for a RegEx that can help me parse an nquad file. An nquad file is a straight text file where each line represents a quad (s, p, o, c):

<http://mysubject> <http://mypredicate> <http://myobject> <http://mycontext> .
<http://mysubject> <http://mypredicate2> <http://myobject2> <http://mycontext> .
<http://mysubject> <http://mypredicate2> <http://myobject2> <http://mycontext> .

The objects can also be literals (instead of uris), in which case they are enclosed with double quotes:

<http://mysubject> <http://mypredicate> "My object" <http://mycontext> .

I'm looking for a regex that given one line of this file, which will give me back a php array in the following format:

[0] => "http://mysubject"
[1] => "http://mypredicate"
[2] => "http://myobject"
[3] => "http://mycontext"

...or in the case where the double quotes are used for the object:

[0] => "http://mysubject"
[1] => "http://mypredicate"
[2] => "My Object"
[3] => "http://mycontext"

One final thing - in an ideal world, the regex will cater for the scenario there may be 1 or more spaces between the various components, e.g.

<http://mysubject>     <http://mypredicate>  "My object"       <http://mycontext> .
robotrobot
  • 193
  • 1
  • 8
  • I've added an answer that uses only a regex and `explode` to extract the necessary strings - http://stackoverflow.com/questions/7976411/regex-in-php-to-extract-components-of-nquad/7976708#7976708 – nickb Nov 02 '11 at 10:35

3 Answers3

2

It seems this can be accomplished as follows (I do not know your character restrictions so it may not work specifically for your needs, but worked for your test cases):

$line = "<http://mysubject> <http://mypredicate> <http://myobject> <http://mycontext>";
$line2 = '<http://mysubject> <http://mypredicate> "My object" <http://mycontext>';

// Remove unnecessary whitespace between entries (change $line to $line2 for testing)
$delimeter = '---';
$result = preg_replace('/([">]){1}\s+(["<]){1}/i', '$1' . $delimeter . '$2', $line);

// Explode on our delimeter
$array = explode( $delimeter, $result);
foreach( $array as &$a)
{
    // Replace the characters we don't want with nothing
    $a = str_replace( array( '<', '.', '>', '"'), '', $a);
}

var_dump( $array);
nickb
  • 59,313
  • 13
  • 108
  • 143
  • Hmm, it's possible that my literals may have '.' in, e.g. - will your assignment to $a using the str_replace replace these with nothing? – robotrobot Nov 02 '11 at 08:40
  • Yes, which is why I was not happy with this method, and formed a regex to do the entire thing in my other answer. – nickb Nov 02 '11 at 10:19
2

I'm going to add another answer as an additional solution using only a regex and explode:

$line = "<http://mysubject> <http://mypredicate> <http://myobject> <http://mycontext>";
$line2 = '<http://mysubject> <http://mypredicate> "My object" <http://mycontext>';

$delimeter = '---'; // Can't use space
$result = preg_replace('/<([^>]*)>\s+<([^>]*)>\s+(?:["<]){1}([^">]*)(?:[">]){1}\s+<([^>]*)>/i', '$1' . $delimeter . '$2' . $delimeter . '$3' . $delimeter . '$4', $line);
$array = explode( $delimeter, $result);
nickb
  • 59,313
  • 13
  • 108
  • 143
0

This regular expression would help:

/(\S+?)\s+(\S+?)\s+(\S+?)\s+(\S+?)\s+\./

(s, p, o, c) values will be in $1, $2, $3, $4 variables.

Aziz Shaikh
  • 16,245
  • 11
  • 62
  • 79