0

I have to get the hostname & pathname from an URL using regular expression.

URL can be in the form of

a) http://test.com/mary/archive/project.aspx --> I want http::/test.com/mary
b) http://www.testmary.com/company --> I want http://www.testmary.com/
c) http://marytest.com/b/about/--> I want http://marytest.com/
d) http://mary.test.com/b/mary/project.aspx --> I want http://mary.test.com/b/mary/

Basically I am looking get records that have url "mary" and get only hostname or pathname depending on where "mary" is.

Any help is appreciated.

Thanks R

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
user2726975
  • 1,285
  • 3
  • 17
  • 26

1 Answers1

2

You don't need CLR and RegEx for this.

DECLARE @x TABLE(id INT, url VARCHAR(2048));

INSERT @x VALUES
(1,'http://test.com/mary/archive/project.aspx'), 
    --> I want http::/test.com/mary
(2,'http://www.testmary.com/company'), 
    --> I want http://www.testmary.com/
(3,'http://marytest.com/b/about/'),
    --> I want http://marytest.com/
(4,'http://mary.test.com/b/mary/project.aspx'),
    --> I want http://mary.test.com/b/mary/
(5,'mary.test.com');
    --> I want mary.test.com

SELECT ID, 
  [output] = SUBSTRING(url, 1, LEN(url) - CHARINDEX('yram', REVERSE(url)) 
    + COALESCE(NULLIF(CHARINDEX('/', SUBSTRING(url,  LEN(url) - CHARINDEX('yram',  
      REVERSE(url)) + 1, 2048)),0),2048))
FROM @x
WHERE url LIKE '%mary%';

Results:

ID  output
--  --------------------------------
1   http://test.com/mary/
2   http://www.testmary.com/
3   http://marytest.com/
4   http://mary.test.com/b/mary/
5   mary.test.com

The only thing I didn't understand is why the first row in the output should be missing a trailing slash, while the other rows include it. In my query this trailing slash is included in all rows. If it shouldn't be included only on that one row, you'll need to explain why.

I do recommend getting a better handle on data cleansing, though. Why would you allow some URLs without the http:// prefix?

Aaron Bertrand
  • 272,866
  • 37
  • 466
  • 490
  • @user2726975 do you have an edge case you did not include in the question? Why didn't you include it in the question? – Aaron Bertrand Aug 29 '13 at 14:03
  • I have a table with millions of records and I am trying to get all the corner cases so everything gets covered. When I ran it today, this one did not work.. – user2726975 Aug 29 '13 at 14:05
  • @user2726975 Try now. Please try to come up with the whole problem from the start, instead of changing the parameters on people after they've submitted an answer. – Aaron Bertrand Aug 29 '13 at 14:08
  • Works like magic:-). Iam a newbie..I had tried to solve on my own before posting the question..I saw you solved in less than 5 mins. How did you approach the problem – user2726975 Aug 29 '13 at 14:27
  • @user2726975 just a lot of experience with bad database design, data cleansing, etc. I started on SQL Server 6.5 many, many years ago. – Aaron Bertrand Aug 29 '13 at 14:47