0

I'm having serious trouble with this and I'm not really experienced enough to understand how I should go about it.

To start off I have a very long string known as $VC. Each time it's slightly different but will always have some things that are the same. $VC is an htmlspecialchars() string that looks something like

<a href="example.com?continue=pid%3D6057413202557366578%26oid283447094297409">Example Link</a>... Lots of other stuff in between here... 80] ,[] ,"","3245697351286309258",[] ,["812750926... and it goes on ...80] ,[] ,"","6057413202557366578",[] ,["103279554... and it continues on

In this case the <a> tag is always the same so I take my information from there. The numbers listed after it such as ,"3245697351286309258",[] and ,"6057413202557366578",[] will also always be in the same format, just different numbers and one of those numbers will always be a specific ID. I then find that specific ID I want, I will always want that number inside pid%3D and %26oid.

$pid = explode("pid%3D", $VC, 2);
$pid = explode("%26oid", $pid[1], 2);
$pid = $pid[0];

In this case that number is 6057413202557366578. Next I want to explode $VC in a way that lets me put everything after ,"6057413202557366578",[] into a variable as its own string.

This is where things start to break down. What I want to do is the following

$vinfo = explode(',"'.$pid.'",[]',$VC,2);
$vinfo = $vinfo[1]; //Everything after the value I used to explode it.

Now naturally I did look around and try other things such as preg_split and preg_replace but I've got to admit, it is beyond me and as far as I can tell, those don't let you put your own variable in the middle of them (e.g. ',"'.$pid.'",[]').

If I'm understanding the whole regular expression idea, there might be other problems in that if I look for it without the $pid variable (e.g. just the surrounding characters), it will pick up the similar parts of the string before it gets to the one I want, (e.g. the ,"3245697351286309258",[]).

I hope I've explained this well enough, the main question though is - How can I get the information after that specific part of the string (',"'.$pid.'",[]') into a variable?

Zei
  • 3
  • 5
  • I'm not sure if I understood correctly, but does [this](http://regex101.com/r/vO5kS5/1) do what you want? It captures the ID in the named group `id`, and all text after `"id",[]` in group 2. – Aran-Fey Oct 05 '14 at 13:26
  • @Rawing Hmm I think that seems correct but it's not working with the actual code. [Here's](http://regex101.com/r/eL1rJ6/2) what it looks like with the actual sort of string I'm working with. Uh actually I think the string is too long for the website, here's a [pastebin](http://pastebin.com/qZJaG7yi). – Zei Oct 05 '14 at 13:40
  • That's quite different than the text you posted originally. All that text is inside a ` – Aran-Fey Oct 05 '14 at 14:17

2 Answers2

0

The problem of capturing more than you want is fixed using capture groups. You'll wrap part of a regular expression in parenthesis to capture it.

You can use preg_match_all to do more robust regular expression capture. You will get an array of things that contains matches to the string that matched the entire pattern plus a string with a partial match for each capture group you use. We'll start by capturing the parts of the string you want. There are no capture groups at this point:

$text = '<a href="example.com?continue=pid%3D6057413202557366578%26oid283447094297409">Example Link</a>... Lots of other stuff in between here... 80] ,[] ,"","3245697351286309258",[] ,["812750926... and it goes on ...80] ,[] ,"","6057413202557366578",[] ,["103279554... and it continues on"';
$pattern = '/,"\\d+",\\[\\]/';
preg_match_all($pattern,
    $text,
    $out, PREG_PATTERN_ORDER);
echo $out[0][0]; //echo ,"3245697351286309258",[]

Now to get just the pids into a variable, you can add a capture group in your pattern. The capture group is done by adding parenthesis:

    $text = ...
$pattern = '/,"(\\d+)",\\[\\]/'; // the \d+ match will be capture
preg_match_all($pattern,
    $text,
    $out, PREG_PATTERN_ORDER);
$pids = $out[1];
echo $pids[0];  // echo 3245697351286309258

Notice the first (and only in this case) capture group is in $out[1] (which is an array). What we have captured is all the digits.

To capture everything else, assuming everything is between square brackets, you could match more and capture it. To address the question, we'll use two capture groups. The first will capture the digits and the second will capture everything matching square brackets and everything in between:

$text = ...;
$pattern = '/,"(\\d+)",\\[\\] ,(\\[.+?\\])/';
preg_match_all($pattern,
    $text,
    $out, PREG_PATTERN_ORDER);
$pids = $out[1];
$contents = $out[2];
echo $pids[0] . "=" . $contents[0] ."\n"; 
echo $pids[1] . "=". $contents[1];
Steve Clanton
  • 4,064
  • 3
  • 32
  • 38
0

I hope this does what you want:

pid%3D(?P<id>\d+).*?"(?P=id)",\[\](?P<vinfo>.*?)}\);<\/script>

It captures the number after pid%3D in group id, and everything after "id",[] (until the next occurence of });</script>) in group vinfo.

Here's a demo with shortened text.

Aran-Fey
  • 39,665
  • 11
  • 104
  • 149
  • Thank you so much! This works, it took me a little while to realise that it doesn't work if I have VC under the effects of htmlspecialchars (which I shouldn't actually need anyway). Great stuff, really! – Zei Oct 05 '14 at 15:25