1

I want to parse html documents for links to twitter profiles using a regex and preg_match_all() in PHP. The twitter links are in this form:

http(s)://twitter.com/#!/twitter_name

I only want to grab links that are purely to the profile page ( eg. nothing after the twitter_name ).

I would like to handle both http and https ( because this is common in these links ).

I would also like to handle //www.twitter.com and //twitter.com ( also common ).

How should I structure my regex?

T. Brian Jones
  • 13,002
  • 25
  • 78
  • 117

4 Answers4

2

How about something like:

(https?:)*\/\/(www.)*twitter.com\/#!/([A-Za-z0-9_]*)

I'm not sure what all characters are valid in a Twitter handle, but I'm assuming 0-9, letters and underscores.

Probably best to run it in case-insensitive mode and get rid of the A-Z as well.

Mike Christensen
  • 88,082
  • 50
  • 208
  • 326
  • I'm pretty sure that `[(http:|https:)]*` doesn't match what you think it should. It matches `hhhhhhh` or `))::::hpph:|||` for example. – Toto Dec 15 '11 at 15:14
  • why kleene star? that would overmatch! – clyfe Dec 17 '11 at 07:17
2

Most general regex (that stops at "/" or space):

(https?:)?\/\/(www\.)?twitter.com\/(#!\/)?([^\/ ].)+
clyfe
  • 23,695
  • 8
  • 85
  • 109
1

Try

preg_match_all('|https?://(?:www\.)?twitter.com/#!/[a-z0-9_]+|im', $text, $matched)

Don't know exacly what characters can be inside twitter username so I assumed [a-z0-9_]+. $matched[1] should be username.

piotrekkr
  • 2,785
  • 2
  • 21
  • 35
1

Try the following:

preg_match_all('~https?://(?:www\.)?twitter.com/#!/([a-z0-9_]+)~im', $html, $matches);

$matches[1] contains the matching user names.

EDIT: For more information on what characters can appear in the user name, see this answer and for more general info see this Twitter Engineering page.

Community
  • 1
  • 1
cmbuckley
  • 40,217
  • 9
  • 77
  • 91