This was a terrific example. Useful answers and some explanations generated very quickly.
Answering my own question does not describe what I am doing. I wanted to thank the contributors, give something back that might help others who look at this question, and explain why I chose one answer. A comment didn't seem right nor is it long enough.
The following assembles each answer along with my (modest, and glad to be corrected) explanations, several of which incorporate explanations from answerers. Poring over the answers taught me a lot, and helped me make the choice of a preferred answer. Others used non base-R functions, one a created function which may well be wonderful but is not as readily available. I liked the 2nd answer because it used only the sub function, but I gave the laurel wreath to the fifth one for its elegant use of two techniques that I was very pleased to learn. Thank you all.
ANS 1
sub(".com", "", regmatches(urls, gregexpr("(\\w+).com", urls)))
gregexpr
finds any one or more words, using the special word character “w+”
, before “.com”
and returns a list with length and usebytes
regmatches
takes what gregexpr
found and returns just the identified strings
sub
removes the first “.com” from each string [I am not sure why gsub would not have worked but perhaps the global sub is a risk when you just want the first instance]
ANS 2
sub('.*[.]','', sub('https?:[/]+[.]?(.*)[.]com[/]','\\1',urls))
the inner sub handles both “http:” and “https:” by the question mark special character ?, which allows the “s” to be optional
the inner sub function then handles one or more “/” with a character class containing only one forward slash but extended by the “+”
, i.e, twice in http://
the next part of the inner sub regex reading to the right includes any number of characters as optional with “[.]?
next, the period before “com” is put in brackets instead of escaping it
then “com” followed by a forward slash [I am not sure I understand that part]
the “’\\1’
preserves only the first part of what the sub function extracted
all the above returns this:
[1] "grand.test" "example" "big.time.sample"
the left-most sub function takes the inner sub function’s result and removes all characters with “.*”
preceding the bracketed period
ANS 3
sapply(strsplit(urls, "/|\\."), function(x) tail(x,2)[1])
First, strsplit
separates each string by a forward slash or a period, using the vertical pipe | which produces a list
[[1]]
[1] "http:" "" "grand" "test" "com"
[[2]]
[1] "https:" "" "example" "com"
[[3]]
[1] "http:" "" "" "big" "time" "sample" "com"
Next, an anonymous function finds the last two elements in each string, with the tail
function, and picks the first, thus neatly eliminating each “.com”
Wrapping those two steps with the sapply function vectorizes the operation of the anonymous function to all three strings
ANS 4
library(stringr)
word(basename(urls), start = -2, sep = "\\.")
the basename
function returns
[1] "grand.test.com" "example.com" ".big.time.sample.com"
From the help to basename()
we learn that “basename removes all of the path up to and including the last path separator (if any)” This neatly removes the http:// and https:// elements.
Then, the word()
function takes the second "word" from the end by using a negative operator (start = -2), given that the separator is . (a period) (sep = "\." ).
ANS 5
pat = "(.*?)(\\w+)(\\.com.*)"
gsub(pat, "\\2", urls)
The regex assigned to the object “pat” breaks each string into three capture groups that together match the entire string
with the gsub
function, searching for “pat” strings, it substitutes back in just capture group (2), the desired part.
Note two techniques here: create an object with your expression, and then use it in the regex. This method helps keep code cleaner and easier to read– as demonstrated by the line with the gsub call. Second, note the use of capture groups, which are components of a regex enclosed in parentheses. They can be used later, as with the “’\2’” in this example
pat = "(.*?)(\\w+)(\\.com.*)"
# ^ ^ ^
# | | |
# (1) (2) (3)
ANS 6
regcapturedmatches(urls, regexpr("([^.\\/]+)\\.com", urls, perl=T))
This may be a good solution, but it relies on a function, regcapturematches
, that is not in base R or another package such as qdap
or stringi
orstringr
Mr. Flick makes the good point that “if you want just a simple vectors for the return value, you can unlist() the results.”
He explains that “The idea for the pattern is to grab everything that's not a dot or a "/" immediately before the ".com".” That is the expression in brackets, with the + sign to indicate it can be multiple.
Perl = T seems to be a good argument for all regular expressions