0

I am using PHP 7.4.1.

I am trying to parse a rss feed from google.

My links look like the following:

https://www.google.com/url?rct=j&sa=t&url=https://www.timeslive.co.za/sunday-times/news/2020-11-01-hawks-following-former-steinhoff-ceo-markus-joostes-money/&ct=ga&cd=CAIyGjRm
https://www.google.com/url?rct=j&sa=t&url=https://www.politifact.com/factchecks/2020/oct/31/raphael-warnock/fact-checking-raphael-warnocks-claim-georgia-sen-k/&ct=ga&cd=CAIyGm
https://www.google.com/url?rct=j&sa=t&url=https://www.benzinga.com/news/20/10/18156683/last-weeks-notable-insider-buys-ibm-intel-raytheon-and-more&ct=ga&cd=CAIyGmM3Yjk5YjRlYWU
https://www.google.com/url?rct=j&sa=t&url=https://stocksregister.com/2020/10/31/insider-trading-at-avino-silver-gold-mines-ltd-nyseasm-what-did-we-note/&ct=ga&cd=CAIyGmM3Yjk5Y
https://www.google.com/url?rct=j&sa=t&url=https://www.businessinsider.co.za/who-received-an-sms-from-markus-jooste-2020-10&ct=ga&cd=CAIyGmM3Yjk5YjRlYWU3MWY2MDY6Y29tOmVuOlVT&am
https://www.google.com/url?rct=j&sa=t&url=https://stocksregister.com/2020/10/31/insider-trading-at-veritone-inc-nasdaqveri-what-did-we-note/&ct=ga&cd=CAIyGmM3Yjk5YjRlYWU3MWY2M
https://www.google.com/url?rct=j&sa=t&url=https://heavy.com/sports/las-vegas-raiders/jj-watt-stephon-gilmore-trade-targets/&ct=ga&cd=CAIyGmM3Yjk5YjRlYWU3MWY2MDY6Y29tOmVuOlVT&a
https://www.google.com/url?rct=j&sa=t&url=https://stocksregister.com/2020/10/31/insider-trading-at-truecar-inc-nasdaqtrue-what-did-we-note/&ct=ga&cd=CAIyGmM3Yjk5YjRlYWU3MWY2MD
https://www.google.com/url?rct=j&sa=t&url=https://stocksregister.com/2020/10/31/insider-trading-at-veeco-instruments-inc-nasdaqveco-what-did-we-note/&ct=ga&cd=CAIyGmM3Yjk5YjRl
https://www.google.com/url?rct=j&sa=t&url=https://stocksregister.com/2020/10/31/insider-trading-at-21vianet-group-inc-nasdaqvnet-what-did-we-note/&ct=ga&cd=CAIyGmM3Yjk5YjRlYWU

I would like to get the real link from url= and cut out the end /&ct=ga&cd=CAIyGjRm.

I tried str_replace however, parsing out the end is difficult as it differs.

Any suggestions how to just get the link?

mickmackusa
  • 43,625
  • 12
  • 83
  • 136
Carol.Kar
  • 4,581
  • 36
  • 131
  • 264

2 Answers2

3

Regex is appropriate when there isn't a legitimate / native / reliable technique to parse text.

PHP offers native functions to parse urls and query strings.

The following snippet involves multiple native functions and WILL perform slower than regex, BUT it will also be far, far less likely to break when your external data source reconfigures their querystring data. For instance, if they add an additional parameter rawurl=, then regex is prone to incorrectly matching these. It is a too common debate between using a legitimate parsing technique or regex (on urls, valid html, bbcode, etc) -- but a developer's primary goal should always be data integrity. Only entertain sacrificing data integrity for execution speed if you are processing inordinately huge volumes of data and the speed boost actually provides a valuable benefit for your system / end users. If you find yourself leaning toward the micro-optimized solution without a sound reason, I'll advise that you not drink that kool-aid.

This is one way that a url can be parse and the url value extracted.

Code: (Demo)

$url = 'https://www.google.com/url?rct=j&sa=t&url=https://www.timeslive.co.za/sunday-times/news/2020-11-01-hawks-following-former-steinhoff-ceo-markus-joostes-money/&ct=ga&cd=CAIyGjRm';

parse_str(
    htmlspecialchars_decode(
        parse_url(
            $url,
            PHP_URL_QUERY
        )
    ),
    $parts
);
echo $parts['url'];

Output:

https://www.timeslive.co.za/sunday-times/news/2020-11-01-hawks-following-former-steinhoff-ceo-markus-joostes-money/

I super-love regex, but not for every task. Avoiding regex here will make your script more readable, reliable, and easier to maintain.

mickmackusa
  • 43,625
  • 12
  • 83
  • 136
  • I don't know if you definitely need to trim the trailing slash. If so, then `rtrim($parts['url'], '/');` – mickmackusa Nov 01 '20 at 07:18
  • This one is much slower than regex... I just tested both on array of `10000000 urls`. Your method took `21.872284889221s`, regex method took `7.2494368553162s`. – Flash Thunder Nov 01 '20 at 07:31
  • 1
    Of course it is. Do you know that the OP is executing this on 10,000,000 urls? This is not about micro-optimization, this is about a reliable solution that is readable and maintainable and does not require everyone on the IT team to understand regex. – mickmackusa Nov 01 '20 at 07:32
  • But using something for semantics, that is 4 times slower ... well not a good idea in my opinion. – Flash Thunder Nov 01 '20 at 07:33
  • 1
    Now I would like you to run the same benchmark test on the 10 urls provided and tell me how many minutes the script will lag between the two techniques. If the OP is isolating the querystring values of 10,000,000 urls, then I will certainly support the use of regex to save 14 seconds. – mickmackusa Nov 01 '20 at 07:34
  • 1
    Correct, you are missing the point of my answer. On 10 urls, no human will ever notice the performance difference between these answers. I make the same argument about using regex-vs-dom-parser on valid HTML too -- parsers are far less likely to silently fail when the input suddenly changes its format/strucure. This is going no where. You will not convince me of anything that I don't already know about these options, so I will disengage with your micro-optimization debate. – mickmackusa Nov 01 '20 at 07:36
  • I am just saying that you are wrong with "regex seems an inappropriate tool". And gave an op a clue why. You don't know the whole project, maybe he wants to phrase years of data. These 10 records are for only 1 day and might be a sample. You may like it or not. Not saying that you answer is incorrect. Just not that efficient. – Flash Thunder Nov 01 '20 at 07:39
  • 1
    This is clearly the most appropriate answer. Using the dedicated tool for the job makes the code more solid, less obscure, easier to maintain. In 95% real-life scenarii, micro-optimizations won't make any kind of visible difference. So in 95% cases, this is the better answer. Since OP didn't mention how many URLs he was dealing with, might as well go the safe and solid way. – Jeto Nov 01 '20 at 09:24
  • @Jeto not really `appropriate`, check the question tags. – Flash Thunder Nov 01 '20 at 09:52
  • The OP does not always know the right tool for the job -- that's part of the benefit of asking for help here. I'll adjust the tags now. – mickmackusa Nov 01 '20 at 09:53
  • still saying that an answer is appropriate when it doesn't match the tags is not really true, changing tags to match the answer is somehow weird, especially when someone did answer to your question directly; but let's face the truth, you are making a big deal of nothing - was curious about the performance and posted my results, not even as an answer, but as a comment - comments are for commenting; regex answer is not mine, but I do think that performance is better than semantics (that statement can vary in different projects - but this one seems to be some data scrapper) – Flash Thunder Nov 01 '20 at 10:01
  • 1
    @FlashThunder So if someone was asking for a regex to parse entire HTML documents, you wouldn't strongly advise to [consider not](https://stackoverflow.com/a/1732454/965834)? If someone were tagging their question `mysql-real-escape-string`, wouldn't you strongly suggest they consider prepares statements instead? Tags mostly just indicate which solutions OP was considering to begin with. If they had considered several of them, they would have stated that in their actual post and explained why they needed regular expressions specifically. But they didn't, so it probably doesn't matter. – Jeto Nov 01 '20 at 10:05
  • @Jeto it's a totally different thing, `mysql_real_escape_string()` is just not a valid way, it is dangerous. It's not about semantics at all. And here the only gain is that someone considers it as a more beautiful code, nothing more. And it's slower. Saying that this way is more valid than regex is simply wrong. Results are the same (not counting execution time). But let's don't make any more mess on SO. We got different views, that's fine. It's up to OP to decide, it's up to us to give pros and cons, so he would have a tools to decide. – Flash Thunder Nov 01 '20 at 10:08
  • @FlashThunder My point was only about tags and how they don't really mean that OPs are looking for a solution *specifically* based on them. And for the rest of your comment, you missed pretty much all of the main reasons to go for this solution, despite them being mentioned multiple times. But anyway, this comment thread has gone way too long already. – Jeto Nov 01 '20 at 10:12
  • @Jeto I didn't miss 'em, just don't agree with 'em. In fact I don't even find that code more clear than regex. But that's my personal view and many people wouldn't agree with that. Even adding few milliseconds can sum up in a bigger projects to seconds or even minutes. – Flash Thunder Nov 01 '20 at 10:14
1

You may use this regex in preg_match_all:

(?<=url=)https?:\S+?(?=&amp;|$)

RegEx Demo

RegEx Details:

  • (?<=url=): If we have url= before current position
  • https?:\S+?: Match a URL starting with http: or https:
  • (?=&amp;|$): If we have &amp; or line end after current position

Code:

php > $s = "https://www.google.com/url?rct=j&amp;sa=t&amp;url=https://www.timeslive.co.za/sunday-times/news/2020-11-01-hawks-following-former-steinhoff-ceo-markus-joostes-money/&amp;ct=ga&amp;cd=CAIyGjRm";
php > preg_match_all('~(?<=url=)https?:\S+?(?=&amp;|$)~', $s, $m);
php > print_r($m[0]);
Array
(
    [0] => https://www.timeslive.co.za/sunday-times/news/2020-11-01-hawks-following-former-steinhoff-ceo-markus-joostes-money/
)
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • When I run the regex in a `preg_match` or `preg_match_all` function I get an error: `preg_match(): Unknown modifier 'h'` Any suggestions why it does not take the pattern? – Carol.Kar Nov 01 '20 at 06:52
  • I have added a sample php code in my answer. Please check – anubhava Nov 01 '20 at 07:08