PHP: Scraping links and text output to file

Question

I have some video feeds available on a website, which I would like to open in XBMC, but can't.

So I was thinking of scraping the links and channel name and output them to some files which my mediacenter can open (one file per channel). I must be done on a small linux box and since I don't know bash nor python but some php (not much), I figured I'd use PHP for the task. But I've run into some problems with regex and the output from php.

The website containing the feeds looks something like this:

... Lots of HTML before this part

<a href="javascript:changeChannel('http://live.provider.com/something/something_else/1.abcdefg.m3u8', 1);">First Channel</a><br>
<a href="javascript:changeChannel('http://live.provider.com/something/something_else/2.abcdefg.m3u8', 2);">Second Channel</a><br>
<a href="javascript:changeChannel('http://live.provider.com/something/something_else/3.abcdefg.m3u8'', 3);">Third Channel</a><br>

.... //  More channels and other html below here..

What I want to extract is the link and the url text:

Ex: http://live.provider.com/something/something_else/1.abcdefg.m3u8

Ex: First Channel

etc.

Currently my code looks like this:

$streamSite = "http://link.to/feed-website.html";

function writeFile($url, $channel) {
        $File = $channel.".strm";
        $Handle = fopen($File, 'w');
        fwrite($Handle, $url);
        fclose($Handle);
}

  $input = @file_get_contents($streamSite) or die("Could not access file: $url");
  $regexp = "(((f|ht){1}tp:\/\/)[-a-zA-Z0-9@:%_\+.~#?&\/\/=]+)";

  if(preg_match_all($regexp, $input, $matches, PREG_SET_ORDER)) {
    foreach($matches as $match) {
        echo serialize($match);
        echo "\r\n";
    }
    unset($match);
  }
?>

With the current regex I was supposed to scrape the url. I've tested the regex on http://regexr.com/ and it works there.

At the moment I'm just printing the result to console.

The current output looks like this:

a:3:{i:0;s:97:"http://live.provider.com/something/something_else/1.abcdefg.m3u8";i:1;s:7:"http://";i:2;s:2:"ht";}
a:3:{i:0;s:97:"http://live.provider.com/something/something_else/2.abcdefg.m3u8";i:1;s:7:"http://";i:2;s:2:"ht";}
a:3:{i:0;s:97:"http://live.provider.com/something/something_else/3.abcdefg.m3u8";i:1;s:7:"http://";i:2;s:2:"ht";}

I can't figure out where the text before and after the links comes from. Is it my serializing that fails or is it the regex?

Could you help my with the regex, so I can scrape the url and the text and put it into an array which I can loop through afterwards and write the content to a .strm file using the function I've written?

Thanks in advance!

score 0 · Accepted Answer · edited May 23 '17 at 12:11

0

In php, '()' are capturing groups. They are basically used to match sub-parts of the text that is matched by the entire regular expression. In contrast to capturing groups, we have non-capturing groups. They are "(?:)".

In this case, capturing groups can be used to fetch the url and the text separately, although we need to match the whole text. This should work for scraping the url and the text.

<?php
$regexp = "/((?:(?:f|ht){1}tp:\/\/)[-a-zA-Z0-9@:%_\+.~#?&\/\/=]+).*?>(.*?)</";
if(preg_match_all($regexp, $input, $matches, PREG_SET_ORDER)) {
    foreach($matches as $match) {
        var_dump($match);
        echo "\r\n";
    }
    unset($match);
}
/*
    For the present set of inputs, the output is- 
    array
      0 => string 'http://live.provider.com/something/something_else/1.abcdefg.m3u8', 1);">First Channel<' (length=86)
      1 => string 'http://live.provider.com/something/something_else/1.abcdefg.m3u8' (length=64)
      2 => string 'First Channel' (length=13)
    array
      0 => string 'http://live.provider.com/something/something_else/2.abcdefg.m3u8', 2);">Second Channel<' (length=87)
      1 => string 'http://live.provider.com/something/something_else/2.abcdefg.m3u8' (length=64)
      2 => string 'Second Channel' (length=14)
    array
      0 => string 'http://live.provider.com/something/something_else/3.abcdefg.m3u8'', 3);">Third Channel<' (length=87)
      1 => string 'http://live.provider.com/something/something_else/3.abcdefg.m3u8' (length=64)
      2 => string 'Third Channel' (length=13)

*/
?>

Here the array[0] matches the whole string, array[1] captures only the url and array[2] captures only the text.

edited May 23 '17 at 12:11

Community

1
1

answered Feb 02 '14 at 01:01

Kamehameha

5,423
1
23
28

Fantastic! I've the script working using your code. Thank you very much! Could you help me with one last thing? Please edit the regex to only allow links with "m3u8" in the link – RazziaDK Feb 02 '14 at 03:04
I think this regex would work - **"/((?:(?:f|ht){1}tp:\/\/)[-a-zA-Z0-9@:%_\+.~#?&\/\/=]+.m3u8).*?>(.*?)"** – Kamehameha Feb 02 '14 at 03:21
It did! Thank you very much. Found out I need to modify some of the feeds for them to work in XBMC, but after playing arround with stristr_array, strstr and some regex I finally have everything working :) Thanks again! – RazziaDK Feb 02 '14 at 05:40
Kamehameha: Can I bother you again? My provider changed the links - now the contain parentheses. Like this javascript:changeChannel('http://bs-live.provider.com/something/something_else/Channel(C9)/Stream(6)/index.m3u8', 54); – RazziaDK Apr 13 '14 at 00:03
@RazziaDK hi, Try this - `((?:(?:f|ht){1}tp:\/\/)[-a-zA-Z0-9@:%_\+.~#?&\/\/=\(\)]+).*?>(.*?)<`.(I've added \(\) in the list of things in a url) This still assumes that the url be-live.provider... is preceded by a http or a ftp... – Kamehameha Apr 13 '14 at 10:30
Sorry about the slow response, but fantastic! It works again, you sir are a true regex wizard :) – RazziaDK May 02 '14 at 21:42

score 0 · Answer 2 · answered Feb 02 '14 at 01:55

The following regex pulls pertinent info out of <a> elements that have href="javascript:changeChannel as in your example data:

~(?<=<a href="javascript:changeChannel\(')([^']+)',\s(\d+)\);">(.+?)</a>~

So do:

$str = <<<STR
  <a href="javascript:changeChannel('http://live.provider.com/something/something_else/1.abcdefg.m3u8', 1);">First Channel</a><br>
  <a href="javascript:changeChannel('http://live.provider.com/something/something_else/2.abcdefg.m3u8', 2);">Second Channel</a><br>
  <a href="javascript:changeChannel('http://live.provider.com/something/something_else/3.abcdefg.m3u8', 3);">Third Channel</a><br>
STR;

$regex = <<<REGEX
  ~(?<=<a href="javascript:changeChannel\(')([^']+)',\s(\d+)\);">(.+?)</a>~
REGEX;

preg_match_all($regex, $str, $matches);

echo '<pre>' . print_r($matches, true) . '</pre>';

OUTPUT

Array
(
    [0] => Array
        (
            [0] => http://live.provider.com/something/something_else/1.abcdefg.m3u8', 1);">First Channel
            [1] => http://live.provider.com/something/something_else/2.abcdefg.m3u8', 2);">Second Channel
            [2] => http://live.provider.com/something/something_else/3.abcdefg.m3u8', 3);">Third Channel
        )

    [1] => Array
        (
            [0] => http://live.provider.com/something/something_else/1.abcdefg.m3u8
            [1] => http://live.provider.com/something/something_else/2.abcdefg.m3u8
            [2] => http://live.provider.com/something/something_else/3.abcdefg.m3u8
        )

    [2] => Array
        (
            [0] => 1
            [1] => 2
            [2] => 3
        )

    [3] => Array
        (
            [0] => First Channel
            [1] => Second Channel
            [2] => Third Channel
        )

)

Hopefully it's what you're looking for :)

PHP: Scraping links and text output to file

2 Answers2

OUTPUT