preg_replace_callback Matching URLs in HTML Paragraphs

Question

I am trying to take the URLs that are in single HTML paragraphs and extract them with PHP's preg_replace_callback. Right now, WordPress does this with:

preg_replace_callback( '|^\s*(https?://[^\s"]+)\s*$|im', 'callback_function', $string );

But that matches a URL on it's own line -- no HTML around it. What I need to do is to match the URL from something like this:

<p>http://youtube.com/</p>

I don't care about the space before or after the paragraph tag, all I want to do is extract that URL to replace it with more detailed information with preg_replace_callback.

Any help out there?

UPDATE: Okay, I have a post's text wit a number of paragraphs like this:

<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis et nunc vel felis vulputate tincidunt. In dapibus tempus sollicitudin. Nullam quis ultricies tortor. Ut malesuada aliquet enim. Aliquam et lobortis urna. Sed commodo malesuada malesuada. Donec cursus nisi nec mauris venenatis pharetra. Curabitur ut leo purus.</p>

<p>http://youtube.com/</p>

<p>Etiam non odio tellus, vel imperdiet nunc. Praesent rutrum sagittis purus, quis pretium eros varius ut. http://google.com/ Ut id orci eu lacus aliquam luctus. Sed dolor quam, suscipit eu dapibus feugiat, lacinia vitae augue.</p>

From that text, all I want to extract is that http://youtube.com/ in the paragraph on its own. I see there is a Google.com link in another paragraph, but I don't want that. All I want is that link (or links) in their own paragraph alone. It would pass to my callback 'http://youtube.com/' as the argument.

Sean, could you post a couple example blurbs of the edge-cases that need to be considered? I'd imagine that the one you posted is fairly simple. — jwegner, Aug 24 '12 at 20:05
Agree. Please post examples. Do you want to pull URLs with the surrounding tags or without? I ask because your example *does* match the URL in the paragraph exactly as posted. Examples will help clarify exactly what you are trying to describe. — Nilpo, Aug 24 '12 at 20:07
Oh, guess I misunderstood - didn't realize you only needed
, thought you needed all tags :) — jwegner, Aug 24 '12 at 20:29

score 1 · Accepted Answer · answered Aug 24 '12 at 20:09

1

You could try this: http://regex101.com/r/rN4vB3

/<p>\s*(https?:\/\/(?:(?!<\/?p>).)+)\s*<\/p>/

The logic is that we look for a <p> tag that starts with http, and then just get everything else in there until we hit a </p>. The first backreference will hold the URL.

This might not be an optimal solution, but should do what you asked.

answered Aug 24 '12 at 20:09

Firas Dib

2,743
19
38

This is perfect for me. I wanted to take these URLs alone and use oEmbed to pull in YouTube videos, Flickr, etc. And I have to replace the URL with the Embed code -- perfect. – Sean Fisher Aug 24 '12 at 20:21

jwegner · Answer 2 · 2012-08-24T20:28:00.153

1

I may be misunderstanding your question, but here's a REGEXP that (ideally) will match any URL in a block of text.

/<[A-Za-z0-9]+[^>]*>https?:\/\/([A-Za-z0-9-]\.)?[A-Za-z0-9][A-Za-z0-9-]+?\.[A-Za-z0-9]+[A-Za-z0-9-\._~:\/\?#\[\]@!$&'()\*+,;=]*<\/[A-Za-z0-9]+>/gi

PLEASE bare in mind that regexp is incredibly complex, and there are almost certainly edge cases that I haven't considered here. If you can update your question with some examples that won't work here, or perhaps leave a comment, I will update the answer.

Update 2
Here's one that should be fairly resilient - takes into cosideration option subdomains, https, and attributes on the HTML tag.

edited Aug 24 '12 at 20:28

answered Aug 24 '12 at 20:14

jwegner

7,043
8
34
56

Timeout, SO is formatting my escapes weird. Don't copy that REGEX! – jwegner Aug 24 '12 at 20:16
Better - sorry for the mishap – jwegner Aug 24 '12 at 20:17
Ah, well I'm taking Markdown content and wanting to replace the lone URLs with oEmbeded things. Thank you anyway! HTML is terrible to parse but at least I know what it's going to look like. :) – Sean Fisher Aug 24 '12 at 20:22
That one should be a little better. Still makes me nervous - test heavily - but it should be close. – jwegner Aug 24 '12 at 20:28

preg_replace_callback Matching URLs in HTML Paragraphs

2 Answers2