0

I am trying to take the URLs that are in single HTML paragraphs and extract them with PHP's preg_replace_callback. Right now, WordPress does this with:

preg_replace_callback( '|^\s*(https?://[^\s"]+)\s*$|im', 'callback_function', $string );

But that matches a URL on it's own line -- no HTML around it. What I need to do is to match the URL from something like this:

<p>http://youtube.com/</p>

I don't care about the space before or after the paragraph tag, all I want to do is extract that URL to replace it with more detailed information with preg_replace_callback.

Any help out there?


UPDATE: Okay, I have a post's text wit a number of paragraphs like this:

<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis et nunc vel felis vulputate tincidunt. In dapibus tempus sollicitudin. Nullam quis ultricies tortor. Ut malesuada aliquet enim. Aliquam et lobortis urna. Sed commodo malesuada malesuada. Donec cursus nisi nec mauris venenatis pharetra. Curabitur ut leo purus.</p>

<p>http://youtube.com/</p>

<p>Etiam non odio tellus, vel imperdiet nunc. Praesent rutrum sagittis purus, quis pretium eros varius ut. http://google.com/ Ut id orci eu lacus aliquam luctus. Sed dolor quam, suscipit eu dapibus feugiat, lacinia vitae augue.</p>

From that text, all I want to extract is that http://youtube.com/ in the paragraph on its own. I see there is a Google.com link in another paragraph, but I don't want that. All I want is that link (or links) in their own paragraph alone. It would pass to my callback 'http://youtube.com/' as the argument.

Sean Fisher
  • 625
  • 5
  • 13
  • 1
    Sean, could you post a couple example blurbs of the edge-cases that need to be considered? I'd imagine that the one you posted is fairly simple. – jwegner Aug 24 '12 at 20:05
  • Agree. Please post examples. Do you want to pull URLs with the surrounding tags or without? I ask because your example *does* match the URL in the paragraph exactly as posted. Examples will help clarify exactly what you are trying to describe. – Nilpo Aug 24 '12 at 20:07
  • Just posed an update with what I need to match! – Sean Fisher Aug 24 '12 at 20:13
  • Oh, guess I misunderstood - didn't realize you only needed

    , thought you needed all tags :)

    – jwegner Aug 24 '12 at 20:29

2 Answers2

1

You could try this: http://regex101.com/r/rN4vB3

/<p>\s*(https?:\/\/(?:(?!<\/?p>).)+)\s*<\/p>/

The logic is that we look for a <p> tag that starts with http, and then just get everything else in there until we hit a </p>. The first backreference will hold the URL.

This might not be an optimal solution, but should do what you asked.

Firas Dib
  • 2,743
  • 19
  • 38
  • This is perfect for me. I wanted to take these URLs alone and use oEmbed to pull in YouTube videos, Flickr, etc. And I have to replace the URL with the Embed code -- perfect. – Sean Fisher Aug 24 '12 at 20:21
1

I may be misunderstanding your question, but here's a REGEXP that (ideally) will match any URL in a block of text.

/<[A-Za-z0-9]+[^>]*>https?:\/\/([A-Za-z0-9-]\.)?[A-Za-z0-9][A-Za-z0-9-]+?\.[A-Za-z0-9]+[A-Za-z0-9-\._~:\/\?#\[\]@!$&'()\*+,;=]*<\/[A-Za-z0-9]+>/gi

PLEASE bare in mind that regexp is incredibly complex, and there are almost certainly edge cases that I haven't considered here. If you can update your question with some examples that won't work here, or perhaps leave a comment, I will update the answer.

Update 2
Here's one that should be fairly resilient - takes into cosideration option subdomains, https, and attributes on the HTML tag.

jwegner
  • 7,043
  • 8
  • 34
  • 56