5

There are lots of answers to this question, but not a single complete one:

With using one regular expression, how do you extract page title from <title>Page title</title>?

There are several other cases how title tags are typed, such as:

<TITLE>Page title</TITLE>

<title>
 Page title</title>
<title>
 Page title
</title>

<title lang="en-US">Page title</title>

...or any combination of above.

And it can be on its own line or in between other tags:

<head>
  <title>Page title</title>
</head>

<head><title>Page title</title></head>

Thanks for help in advance.

UDPATE: So, the regex approach might not be the best solution to this. Which PHP based HTML parser could handle all scenarios, where HTML is well formed (or not so well)?

UPDATE 2: sp00m's regex (https://stackoverflow.com/a/13510307/1844607) seems to be working in all cases. I'll get back to this if needed.

Community
  • 1
  • 1
Jari
  • 749
  • 1
  • 6
  • 9

3 Answers3

11

Use a HTML parser instead. But in case of:

<title[^>]*>(.*?)</title>

Demo

Community
  • 1
  • 1
sp00m
  • 47,968
  • 31
  • 142
  • 252
  • Turned into PHP `preg_match("/]*>(.*?)<\/title>/",$html,$title);` didn't get the title from http://www.gameinformer.com/b/features/archive/2012/11/21/the-top-10-grand-theft-auto-characters-of-all-time.aspx – Jari Nov 22 '12 at 10:29
  • 1
    @Jari Use the `i` (case insensitive), the `m` (multiline) and the `s` (the dot metachar includes new lines) flags, i.e. `/]*>(.*?)<\/title>/ims`. See [pattern modifiers](http://php.net/manual/en/reference.pcre.pattern.modifiers.php). – sp00m Nov 22 '12 at 10:37
  • This seems to be working, thanks! I'll get back to this if I find some new scenarios where it fails. – Jari Nov 22 '12 at 10:44
  • 1
    @Jari Moreover, double-escape the nested `/` char: `/]*>(.*?)<\\/title>/ims` – sp00m Nov 22 '12 at 10:45
  • Or use something different as delimiter. preg_match("#]*>(.*?)#Umsi", $html, $matches); – Alex Apr 17 '14 at 07:49
  • Also it is a good idea to cut html content's first x caharacters like mb_substr($urlContent, 0, 20000, "utf-8"); i prefer first 20000. – muratgozel Jun 10 '15 at 15:37
2

Use the DOMDocument class:

$doc = new DOMDocument();
$doc->loadHTML($html);
$titles = $doc->getElementsByTagName("title");
echo $titles->item[0]->nodeValue;
Asad Saeeduddin
  • 46,193
  • 6
  • 90
  • 139
  • This approach could work, but it seems to fail in cases where the HTML isn't as "validated" as it should be. Such as this: http://www.gameinformer.com/b/features/archive/2012/11/21/the-top-10-grand-theft-auto-characters-of-all-time.aspx – Jari Nov 22 '12 at 10:20
  • @Jari What does that link have to do with anything? Could you please provide some cases where a regex is a superior approach due to the html not being "validated" enough? The less well formed your HTML the more impossible it becomes to come up with one catchall regex. – Asad Saeeduddin Nov 22 '12 at 10:26
  • That's just one example of many pages this should be able to get the page title from. If a HTML parser is better than any regex, it suits well for me, but then the real question is: which HTML parser handles unvalidated (broken) HTML good enough? – Jari Nov 22 '12 at 10:32
  • @Jari Could you please give me some sample HTML instead of pointing me to your website? – Asad Saeeduddin Nov 22 '12 at 10:34
  • I'll just copy-paste what's on there: ` The Top 10 Grand Theft Auto Characters of All Time - Features - www.GameInformer.com ` but it might not be exact match, since there might be some characters lost. (Edit: Yup, newlines and empty spaces are missing) – Jari Nov 22 '12 at 10:38
0

Use this regex:

<title>[\s\S]*?</title>
Alexis Pigeon
  • 7,423
  • 11
  • 39
  • 44
F11
  • 3,703
  • 12
  • 49
  • 83
  • Not sure if I got it entirely right in PHP `preg_match("/[\s]*?[\s\S]*?<\/title>[\s]*?<\/head>/",$html,$title);`, but this didn't get the page title from here: http://www.gameinformer.com/b/features/archive/2012/11/21/the-top-10-grand-theft-auto-characters-of-all-time.aspx – Jari Nov 22 '12 at 10:27