Regular expression to get page title

Question

There are lots of answers to this question, but not a single complete one:

With using one regular expression, how do you extract page title from <title>Page title</title>?

There are several other cases how title tags are typed, such as:

<TITLE>Page title</TITLE>

<title>
 Page title</title>
<title>
 Page title
</title>

<title lang="en-US">Page title</title>

...or any combination of above.

And it can be on its own line or in between other tags:

<head>
  <title>Page title</title>
</head>

<head><title>Page title</title></head>

Thanks for help in advance.

UDPATE: So, the regex approach might not be the best solution to this. Which PHP based HTML parser could handle all scenarios, where HTML is well formed (or not so well)?

UPDATE 2: sp00m's regex (https://stackoverflow.com/a/13510307/1844607) seems to be working in all cases. I'll get back to this if needed.

Thats one of the reasons, why regex is the wrong tool for this job. Why don't you use a html parser? — stema, Nov 22 '12 at 10:07

score 11 · Accepted Answer · edited May 23 '17 at 12:34

11

Use a HTML parser instead. But in case of:

<title[^>]*>(.*?)</title>

Demo

edited May 23 '17 at 12:34

Community

1
1

answered Nov 22 '12 at 10:10

sp00m

47,968
31
142
252

Turned into PHP `preg_match("/]*>(.*?)<\/title>/",$html,$title);` didn't get the title from http://www.gameinformer.com/b/features/archive/2012/11/21/the-top-10-grand-theft-auto-characters-of-all-time.aspx – Jari Nov 22 '12 at 10:29
1

@Jari Use the `i` (case insensitive), the `m` (multiline) and the `s` (the dot metachar includes new lines) flags, i.e. `/]*>(.*?)<\/title>/ims`. See [pattern modifiers](http://php.net/manual/en/reference.pcre.pattern.modifiers.php). – sp00m Nov 22 '12 at 10:37
This seems to be working, thanks! I'll get back to this if I find some new scenarios where it fails. – Jari Nov 22 '12 at 10:44
1

@Jari Moreover, double-escape the nested `/` char: `/]*>(.*?)<\\/title>/ims` – sp00m Nov 22 '12 at 10:45
Or use something different as delimiter. preg_match("#]*>(.*?)#Umsi", $html, $matches); – Alex Apr 17 '14 at 07:49
Also it is a good idea to cut html content's first x caharacters like mb_substr($urlContent, 0, 20000, "utf-8"); i prefer first 20000. – muratgozel Jun 10 '15 at 15:37

score 2 · Answer 2 · answered Nov 22 '12 at 10:10

2

Use the DOMDocument class:

$doc = new DOMDocument();
$doc->loadHTML($html);
$titles = $doc->getElementsByTagName("title");
echo $titles->item[0]->nodeValue;

answered Nov 22 '12 at 10:10

Asad Saeeduddin

46,193
6
90
139

This approach could work, but it seems to fail in cases where the HTML isn't as "validated" as it should be. Such as this: http://www.gameinformer.com/b/features/archive/2012/11/21/the-top-10-grand-theft-auto-characters-of-all-time.aspx – Jari Nov 22 '12 at 10:20
@Jari What does that link have to do with anything? Could you please provide some cases where a regex is a superior approach due to the html not being "validated" enough? The less well formed your HTML the more impossible it becomes to come up with one catchall regex. – Asad Saeeduddin Nov 22 '12 at 10:26
That's just one example of many pages this should be able to get the page title from. If a HTML parser is better than any regex, it suits well for me, but then the real question is: which HTML parser handles unvalidated (broken) HTML good enough? – Jari Nov 22 '12 at 10:32
@Jari Could you please give me some sample HTML instead of pointing me to your website? – Asad Saeeduddin Nov 22 '12 at 10:34
I'll just copy-paste what's on there: ` The Top 10 Grand Theft Auto Characters of All Time - Features - www.GameInformer.com ` but it might not be exact match, since there might be some characters lost. (Edit: Yup, newlines and empty spaces are missing) – Jari Nov 22 '12 at 10:38

score 0 · Answer 3 · edited Nov 22 '12 at 10:42

0

Use this regex:

<title>[\s\S]*?</title>

edited Nov 22 '12 at 10:42

Alexis Pigeon

7,423
11
39
44

answered Nov 22 '12 at 10:20

F11

3,703
12
49
83

Not sure if I got it entirely right in PHP `preg_match("/[\s]*?[\s\S]*?<\/title>[\s]*?<\/head>/",$html,$title);`, but this didn't get the page title from here: http://www.gameinformer.com/b/features/archive/2012/11/21/the-top-10-grand-theft-auto-characters-of-all-time.aspx – Jari Nov 22 '12 at 10:27

Regular expression to get page title

3 Answers3

Linked