How to catch filetypes in malformed URLs

Question

Just wondering how I can extract or match the specific file type, since there are a lot of malformed URLs and directories.

So I need a good regex to match only the real ones.

http://domain.com/1/image.jpg <-match .jpg
http://domain.com/1/image_1.jpg/.gif <-match first .jpg
http://domain.com/1/image_1.jpg/image.png <-match first .jpg
http://domain.com/1/image_1.jpg <-match .jpg
http://domain.com/1/image.jpg.jpeg <-match only the first .jpg
http://domain.com/1/.jpg <-not match
http://domain.com/.jpg.jpg <- not match
/1/.jpg <-not match
/.jpg.png <-match the first jpg
/image.jpg.png <-match the first jpg

I'm trying with this piece of code:

preg_match_all('([a-zA-Z0-9.-_](jpg))i', $url, $matches);

Any ideas?

score 0 · Answer 1 · answered Apr 18 '13 at 02:27

0

preg_match('(^(http://domain.com/\w.*?\.jpg))i', $url, $matches);

This will match everything from the start of the string up to the first .jpg. The filename part must start with a letter, number, or _.

answered Apr 18 '13 at 02:27

Explosion Pills

188,624
52
326
405

this works nice, but I forgot that some urls are incomplete, just referenced directories like /1/.jpg.gif – greenbandit Apr 18 '13 at 02:33

score 0 · Answer 2 · edited May 23 '17 at 12:04

Parsing URLs with regular expressions is usually a bad idea. See Getting parts of a URL (Regex) for a related question. In particular, look at this answer, then realize that parse_url might be a good start. Take $result['path'] and use a file name parsing API on it to extract the extension.

I'm not sure exactly what you are asking for though.

http://domain.com/1/image_1.jpg/.gif <-match first .jpg
http://domain.com/1/image_1.jpg/image.png <-match first .jpg

In both of these cases image_1.jpg is a perfectly valid directory name. You could split the path on '/' and check each one for "validity".

Edit I just noticed that you need this to work with relative URLs as well. parse_url does not work well in that case.

How to catch filetypes in malformed URLs

2 Answers2