5

I have following pattern in my column

xyz@gmail.com
abc@hotmail.com

Now, I want to extract text after @ and before . i.e gmail and hotmail .I am able to extract text after . with following code.

sub(".*@", "", email)

How can I modify above to fit in my use case?

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
Neil
  • 7,937
  • 22
  • 87
  • 145

4 Answers4

8

You:

  1. really need to read Section 3 of RFC 3696 (TLDR: the @ can appear in multiple places)
  2. seem to not have considered that an email can be "someone@department.example.com", "someone.else@yet.another.department.example.com" (i.e. naively assuming only a domain could come back to bite you at some point in this analysis)
  3. should be aware that if you're really looking for the email "domain name" then you also have to consider what really constitutes a domain name and a proper suffix.

So — unless you know for sure that you have and always will have simple email addresses — might I suggest:

library(stringi)
library(urltools)
library(dplyr)
library(purrr)

emails <- c("yz@gmail.com", "abc@hotmail.com",
            "someone@department.example.com",
            "someone.else@yet.another.department.com",
            "some.brit@froodyorg.co.uk")

stri_locate_last_fixed(emails, "@")[,"end"] %>%
  map2_df(emails, function(x, y) {
    substr(y, x+1, nchar(y)) %>%
      suffix_extract()
  })
##                         host    subdomain      domain suffix
## 1                  gmail.com         <NA>       gmail    com
## 2                hotmail.com         <NA>     hotmail    com
## 3      deparment.example.com   department     example    com
## 4 yet.another.department.com  yet.another  department    com
## 5             froodyco.co.uk         <NA>   froodyorg  co.uk

Note the proper splitting of subdomain, domain & suffix, especially for the last one.

Knowing this, we can then change the code to:

stri_locate_last_fixed(emails, "@")[,"end"] %>%
  map2_chr(emails, function(x, y) {
    substr(y, x+1, nchar(y)) %>%
      suffix_extract() %>%
      mutate(full_domain=ifelse(is.na(subdomain), domain, sprintf("%s.%s", subdomain, domain))) %>%
      select(full_domain) %>%
      flatten_chr()
  })
## [1] "gmail"                   "hotmail"               
## [3] "department.example"      "yet.another.department"
## [5] "froodyorg"
Community
  • 1
  • 1
hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
5

We can use gsub

gsub(".*@|\\..*", "", email)
#[1] "gmail"   "hotmail"
akrun
  • 874,273
  • 37
  • 540
  • 662
3

You can use:

emails <- c("xyz@gmail.com", "abc@hotmail.com")
emails_new <- gsub("@(.+)$", "\\1", emails)
emails_new
# [1] "gmail.com"   "hotmail.com"

See a demo on ideone.com.

Jan
  • 42,290
  • 8
  • 54
  • 79
1

This is @hrbrmstr's function with stringr:

stringr::str_locate_all(email,"@") %>% purrr::map_int(~ .[2]) %>%
purrr::map2_df(email, ~ {
  stringr::str_sub(.y, .x+1, nchar(.y)) %>%
    urltools::suffix_extract()
})
xaviescacs
  • 309
  • 1
  • 5