Extract string elements that possibly appear multiple times, or not at all

Question

Start with a character vector of URLs. The goal is to end up with only the name of the company, meaning a column with only "test", "example" and "sample" in the example below.

urls <- c("http://grand.test.com/", "https://example.com/", 
          "http://.big.time.sample.com/")

Remove the ".com" and whatever might follow it and keep the first part:

urls <- sapply(strsplit(urls, split="(?<=.)(?=\\.com)", perl=T), "[", 1) 

urls
# [1] "http://grand.test"    "https://example"      "http://.big.time.sample"

My next step is to remove the http:// and https:// portions with a chained gsub() call:

urls <- gsub("^http://", "",  gsub("^https://", "", urls))

urls
# [1] "grand.test"       "example"          ".big.time.sample"

But here is where I need help. How do I handle the multiple periods (dots) before the company name in the first and third strings of urls? For example, the call below returns NA for the second string, since the "example" string has no period remaining. Or if I retain only the first part, I lose a company name.

urls  <- sapply(strsplit(urls, split = "\\."), "[", 2)
urls
# [1] "test" NA     "big"

urls  <- sapply(strsplit(urls, split = "\\."), "[", 1)
urls
# [1] "grand"   "example" ""

Perhaps an ifelse() call that counts the number of periods remaining and only uses strsplit if there is more than one period? Also note that it is possible there are two or more periods before the company name. I don't know how to do lookarounds, which might solve my problem. But this didn't

strsplit(urls, split="(?=\\.)", perl=T)

Thank you for any suggestions.

I am new at this. I like both answers, the simplicity of the user20650 but the wrapped in http(s) step in agstudy. Am I supposed to pick one and one only to click on the check mark for an answer? Should I wait for a while longer? — lawyeR, Jun 19 '14 at 22:22
Might be good to wait as there are a few regex users on site and you might get a simpler answer. [But yes click the arrow next to the answer you like best but you can upvote as you want] — user20650, Jun 19 '14 at 22:31

score 3 · Answer 1 · answered Jun 19 '14 at 22:12

3

I think there should be simpler but this works:

 sub('.*[.]','',sub('https?:[/]+[.]?(.*)[.]com[/]','\\1',urls))
 [1] "test"    "example" "sample"

Where "urls" is you firs url's vector.

answered Jun 19 '14 at 22:12

agstudy

119,832
17
199
261

score 3 · Answer 2 · answered Jun 19 '14 at 22:14

3

I think there will be a way to just extract the word before '.com` but maybe gives an idea

sub(".com", "", regmatches(urls, gregexpr("(\\w+).com", urls)))

answered Jun 19 '14 at 22:14

user20650

24,654
5
56
91

Josh O'Brien · Accepted Answer · 2014-06-20T13:32:25.407

3

Here's an approach that may be easier to understand and to generalize than some of the others:

pat = "(.*?)(\\w+)(\\.com.*)"
gsub(pat, "\\2", urls)

It works by breaking each string up into three capture groups that together match the entire string, and substituting back in just capture group (2), the one that you want.

pat = "(.*?)(\\w+)(\\.com.*)"
#        ^    ^       ^
#        |    |       |
#       (1)  (2)     (3)

Edit (adding explanation of ? modifier):

Do note that capture group (1) needs to include the "ungreedy" or "minimal" quantifier ? (also sometimes called "lazy" or "reluctant"). It essentially tells the regex engine to match as many characters as it can ... without using up any that could otherwise become a part of the following capture group (2).

Without a trailing ?, repetition quantifiers are by default greedy; in this case, a greedy capture group, (.*), since it matches any number of any type of characters, would "eat up" all characters in the string, leaving none at all for the other two capture groups -- not a behavior we want!

edited Jun 20 '14 at 13:32

answered Jun 20 '14 at 01:12

Josh O'Brien

159,210
26
366
455

Nice one. Very nice. – Rich Scriven Jun 20 '14 at 01:15
Great Josh, I was trying and failing to get this. Could you give an wee explanation to how the first term is specified please `(.*?)`. Thanks – user20650 Jun 20 '14 at 08:18
@user20650 -- Done! Even if the added explanation doesn't get you all the way there on its own, it'll at least give you the search terms ("greedy" and "minimal") that should let you track this the rest of the way down. – Josh O'Brien Jun 20 '14 at 13:16
Thanks a lot, its good of you. Nice explanation and thanks also for the keywords. – user20650 Jun 20 '14 at 13:52

score 2 · Answer 4 · answered Jun 19 '14 at 22:37

2

Using strsplit might be worth a try too:

sapply(strsplit(urls,"/|\\."),function(x) tail(x,2)[1])
#[1] "test"    "example" "sample"

answered Jun 19 '14 at 22:37

thelatemail

91,185
12
128
188

score 2 · Answer 5 · answered Jun 20 '14 at 17:39

This was a terrific example. Useful answers and some explanations generated very quickly.

Answering my own question does not describe what I am doing. I wanted to thank the contributors, give something back that might help others who look at this question, and explain why I chose one answer. A comment didn't seem right nor is it long enough.

The following assembles each answer along with my (modest, and glad to be corrected) explanations, several of which incorporate explanations from answerers. Poring over the answers taught me a lot, and helped me make the choice of a preferred answer. Others used non base-R functions, one a created function which may well be wonderful but is not as readily available. I liked the 2nd answer because it used only the sub function, but I gave the laurel wreath to the fifth one for its elegant use of two techniques that I was very pleased to learn. Thank you all.

ANS 1

sub(".com", "", regmatches(urls, gregexpr("(\\w+).com", urls)))

gregexpr finds any one or more words, using the special word character “w+”, before “.com” and returns a list with length and usebytes

regmatches takes what gregexpr found and returns just the identified strings

sub removes the first “.com” from each string [I am not sure why gsub would not have worked but perhaps the global sub is a risk when you just want the first instance]

ANS 2

sub('.*[.]','', sub('https?:[/]+[.]?(.*)[.]com[/]','\\1',urls))

the inner sub handles both “http:” and “https:” by the question mark special character ?, which allows the “s” to be optional

the inner sub function then handles one or more “/” with a character class containing only one forward slash but extended by the “+”, i.e, twice in http://

the next part of the inner sub regex reading to the right includes any number of characters as optional with “[.]?

next, the period before “com” is put in brackets instead of escaping it

then “com” followed by a forward slash [I am not sure I understand that part]

the “’\\1’ preserves only the first part of what the sub function extracted

all the above returns this:

[1] "grand.test"      "example"         "big.time.sample"

the left-most sub function takes the inner sub function’s result and removes all characters with “.*” preceding the bracketed period

ANS 3

sapply(strsplit(urls, "/|\\."), function(x) tail(x,2)[1])

First, strsplit separates each string by a forward slash or a period, using the vertical pipe | which produces a list

[[1]]
[1] "http:" ""      "grand" "test"  "com"  

[[2]]
[1] "https:"  ""        "example" "com"    

[[3]]
[1] "http:"  ""       ""       "big"    "time"   "sample" "com"

Next, an anonymous function finds the last two elements in each string, with the tail function, and picks the first, thus neatly eliminating each “.com”

Wrapping those two steps with the sapply function vectorizes the operation of the anonymous function to all three strings

ANS 4

library(stringr)
word(basename(urls), start = -2, sep = "\\.")

the basename function returns

[1] "grand.test.com"       "example.com"          ".big.time.sample.com"

From the help to basename() we learn that “basename removes all of the path up to and including the last path separator (if any)” This neatly removes the http:// and https:// elements.

Then, the word() function takes the second "word" from the end by using a negative operator (start = -2), given that the separator is . (a period) (sep = "\." ).

ANS 5

pat = "(.*?)(\\w+)(\\.com.*)"
gsub(pat, "\\2", urls)

The regex assigned to the object “pat” breaks each string into three capture groups that together match the entire string

with the gsub function, searching for “pat” strings, it substitutes back in just capture group (2), the desired part.

Note two techniques here: create an object with your expression, and then use it in the regex. This method helps keep code cleaner and easier to read– as demonstrated by the line with the gsub call. Second, note the use of capture groups, which are components of a regex enclosed in parentheses. They can be used later, as with the “’\2’” in this example

pat = "(.*?)(\\w+)(\\.com.*)"
#        ^    ^       ^
#        |    |       |
#       (1)  (2)     (3)

ANS 6

regcapturedmatches(urls, regexpr("([^.\\/]+)\\.com", urls, perl=T))

This may be a good solution, but it relies on a function, regcapturematches, that is not in base R or another package such as qdap or stringi orstringr

Mr. Flick makes the good point that “if you want just a simple vectors for the return value, you can unlist() the results.”

He explains that “The idea for the pattern is to grab everything that's not a dot or a "/" immediately before the ".com".” That is the expression in brackets, with the + sign to indicate it can be multiple.

Perl = T seems to be a good argument for all regular expressions

Rich Scriven · Answer 6 · 2014-06-30T18:39:21.783

1

You could use stringr::word(), along with basename().

basename() is handy when working with URLs.

> library(stringr)
> word(basename(urls), start = -2, sep = "\\.")
# [1] "test"    "example" "sample"

basename(urls) gives

[1] "grand.test.com"       "example.com"          ".big.time.sample.com"

Then, in the word() function, we take the second word from the end ( start = -2 ), given that the separator is . ( sep = "\\." ).

edited Jun 30 '14 at 18:39

answered Jun 19 '14 at 22:37

Rich Scriven

97,041
11
181
245

Could you elaborate on your answer some more? 6k; by now you should know code-only answers aren't acceptable. – Qix - MONICA WAS MISTREATED Jun 19 '14 at 23:04
1

nice answer. didnt know either of these functions – user20650 Jun 20 '14 at 00:39

score 1 · Answer 7 · answered Jun 19 '14 at 22:46

Because you have never have enough regular expression options, here's one using a regcapturedmatches.R function

regcapturedmatches(urls, regexpr("([^.\\/]+)\\.com", urls, perl=T))

If you want just a simple vectors for the return value, you can unlist() the results. The idea for the pattern to to grab everything that's not a dot or a "/" immediately before the ".com".

Extract string elements that possibly appear multiple times, or not at all

7 Answers7