1

I have data in csv which contains following column

ARTICLE_URL
http://twitter.com/aviryadsh/statuses/528219883872337920
http://www.ibtimes.co.in/2014

I want to create an another columns next to this column where I can have only the web address like twitter.com, team-bhp.com, ibtimes.co.in,broadbandforum.co.

I have tried

text$ne=str_extract(Brand$ARTICLE_URL, '\\w+(.com)')

but this is giving only url which are ending with .com how to fetch all other also.

JJJ
  • 1,009
  • 6
  • 19
  • 31
  • You could either use a complex regex, or two simple string replacements. The simple string replacements would look like this: `tmp <- str_replace(Brand$ARTICLE_URL, "http://(www.)?", ""); text$ne <- str_replace(tmp, "/.*", "")` – tblznbits Dec 11 '15 at 19:25
  • Thanks Marc for your reply. But the problem is that In this perticular columns some cells contains http://, some have https:// and some are starting with www. only, so is their any way that I can give or condition here to get all possible combinations or any other way...Please provide your valuable inputs. – Anurag Sharma Dec 13 '15 at 16:00
  • `str_replace` and `str_replace_all` can take regular expressions for the pattern to look for. So we can just slightly change the first part of the code: `str_replace_all(Brand$ARTICLE_URL, "https://|http://|www.", "")`. That should remove everything from the beginning of your URLs. – tblznbits Dec 13 '15 at 22:55

1 Answers1

0

I'd recommend using string replacement as opposed to string extraction in this instance. It's possible to do with string extraction, but the regular expression is kind of messy and not as readable as a two-step string replacement method. Here's how I'd do it:

urls <- c("http://twitter.com/aviryadsh/statuses/528219883872337920", "http://www.ibtimes.co.in/2014", "https://www.ibtimes.co.in/2014")
tmp <- stringr::str_replace_all(urls, "https?://|www.", "")
domains <- stringr::str_replace_all(tmp, "/.*", "")

And then looking at our output:

domains
# [1] "twitter.com"   "ibtimes.co.in" "ibtimes.co.in"
tblznbits
  • 6,602
  • 6
  • 36
  • 66
  • really sorry to express my sincere thanks to "brittenb" and "Marc B" for such great help. it is working fine and exactly what I wanted to do. but still one thing want to understand the role of "?" sign in "https?://|www.", "") can you please help me understand. – Anurag Sharma Dec 18 '15 at 16:48
  • Sure thing! The question mark is used to signify that there can be 0 or 1 of the preceeding character. So, in this instance, it means that there be an "s" or no "s" in the http portion. Basically, it's allowing for http and https URLs. Does that make sense? – tblznbits Dec 18 '15 at 17:59