Extract alphanumeric words and words with more than 1 uppercase using R

Question

I am new to R programming and want to try extracting alphanumeric words AND words containing more than 1 uppercase.

Below is an example of the string and my desired output for it.

    x <- c("123AB123 Electrical CDe FG123-4 ...", 
           "12/1/17 ABCD How are you today A123B", 
           "20.9.12 Eat / Drink XY1234 for PQRS1",
           "Going home H123a1 ab-cd1",
           "Change channel for al1234 to al5678")

    #Desired Output
    #[1] "123AB123 CDe FG123-4"  "ABCD A123B"  "XY1234 PQRS"  
    #[2] "H123a1 ab-cd1"  "al1234 al5678"

I have come across 2 separate solutions so far on Stack Overflow:

Extracts all words that contain a number --> Not helpful to me because the column I'm applying the function to contains many date strings; "12/1/17 ABCD How are you today A123B"
Identify strings that have more than one caps/uppercase --> Pierre Lafortune has provided the following solution:

how-to-count-capslock-in-string-using-r

    library(stringr)
    str_count(x, "\\b[A-Z]{2,}\\b")

His code provides the number of times a string has more than 1 uppercase but I want to extract those words in addition to extracting alphanumeric words too.

Forgive me if my question or research is not comprehensive enough. I will post my researched solution for extracting all words containing a number in 12 hours when i have access to my work station which contains R and the dataset.

If you are wondering about the randomness of the strings, I translated the column of the dataset from German to English using Google api in R. The next step is to extract equipment names. The equipment name extraction is where i am stuck — hersh476, Aug 04 '17 at 02:28
Try `str_extract_all(x, "(?<!\\S)(?:(?=\\S*\\p{L})(?=\\S*\\d)\\S+|(?:\\S*\\p{Lu}){2}\\S*)")` — Wiktor Stribiżew, Aug 04 '17 at 07:19

score 2 · Answer 1 · answered Aug 04 '17 at 03:56

This works:

library(stringr)

# split words from strings into one-word-per element vector
y <- unlist(str_split(x, ' '))

# find strings with at least 2 uppercase
uppers <- str_count(y, '[A-Z]')>1

# find strings with at least 1 letter
alphas <- str_detect(y, '[:alpha:]')

# find strings with at least 1 number
nums <- str_detect(y, '[:digit:]')

# subset vector to those that have 2 uppercase OR a letter AND a number
y[uppers | (alphas & nums)]

 [1] "123AB123" "CDe"      "FG123-4"  "ABCD"     "A123B"    "XY1234"  
 [7] "PQRS1"    "H123a1"   "ab-cd1"   "al1234"   "al5678"

Wiktor Stribiżew · Accepted Answer · 2017-08-04T13:47:33.283

2

A single regex solution will also work:

> res <- str_extract_all(x, "(?<!\\S)(?:(?=\\S*\\p{L})(?=\\S*\\d)\\S+|(?:\\S*\\p{Lu}){2}\\S*)")
> unlist(res)
 [1] "123AB123" "CDe"      "FG123-4"  "ABCD"     "A123B"    "XY1234"  
 [7] "PQRS1"    "H123a1"   "ab-cd1"   "al1234"   "al5678"

This will also work with regmatches in base R using the PCRE regex engine:

> res2 <- regmatches(x, gregexpr("(?<!\\S)(?:(?=\\S*\\p{L})(?=\\S*\\d)\\S+|(?:\\S*\\p{Lu}){2}\\S*)", x, perl=TRUE))
> unlist(res2)
 [1] "123AB123" "CDe"      "FG123-4"  "ABCD"     "A123B"    "XY1234"  
 [7] "PQRS1"    "H123a1"   "ab-cd1"   "al1234"   "al5678"

Why does it work?

(?<!\\S) - finds a position after a whitespace or start of string
(?: - start of a non-capturing group that has two alternative patterns defined:
- (?=\\S*\\p{L})(?=\\S*\\d)\\S+
  - (?=\\S*\\p{L}) - make sure there is a letter after 0+ non-whitespace chars (for better performance, replace \\S* with [^\\s\\p{L}]*)
  - (?=\\S*\\d) - make sure there is a digit after 0+ non-whitespace chars (for better performance, replace \\S* with [^\\s\\d]*)
  - \\S+ - match 1 or more non-whitespace chars
- | - or
- (?:\\S*\\p{Lu}){2}\\S*:
  - (?:\\S*\\p{Lu}){2} - 2 occurrences of 0+ non-whitespace chars (\\S*, for better performace, replace with [^\\s\\p{Lu}]*) followed with 1 uppercase letter (\\p{Lu})
  - \\S* - 0+ non-whitespace chars
) - end of the non-capturing group.

To join the matches pertaining to each character vector, you may use

unlist(lapply(res, function(c) paste(unlist(c), collapse=" ")))

See an online R demo.

Output:

[1] "123AB123 CDe FG123-4" "ABCD A123B"           "XY1234 PQRS1"        
[4] "H123a1 ab-cd1"        "al1234 al5678"

edited Aug 04 '17 at 13:47

answered Aug 04 '17 at 09:16

Wiktor Stribiżew

607,720
39
448
563

Thank you for the detailed comments. I am getting the exact output that you have provided in your answer. While this is useful, can you also provide a solution where the extracted words are then merged back together to match the original string in the column they were extracted from? Currently, you will see that the number of elements increases so the output cannot be added as a new column which corresponds to the old column. Example: "Change channel for al1234 to al5678" --> "al1234" "al5678" --> Can this 2 element output be 1 element? – hersh476 Aug 04 '17 at 13:39
This link seems to have a solution --> https://stackoverflow.com/questions/43240594/unlist-multiple-values-in-dataframe-column-but-keep-track-of-the-row-number?rq=1 Is your similar? – hersh476 Aug 04 '17 at 13:44
This works ---> https://ideone.com/5s1FAX Please add this to your original answer so anyone using your answer in future has both type of outputs. Thank you – hersh476 Aug 04 '17 at 13:48
Can you please provide a solution similar to 1 provided by MrFlick: https://stackoverflow.com/questions/43240594/unlist-multiple-values-in-dataframe-column-but-keep-track-of-the-row-number?noredirect=1&lq=1 for your res or res2 queries so I can keep track of the rows? res <- str_extract_all(x, "(?<!\\S)(?:(?=\\S*\\p{L})(?=\\S*\\d)\\S+|(?:\\S*\\p{Lu}){2}\\S*)") file <- unlist(res) df <- as.data.frame(as.matrix(file),stringsAsFactors=F) A column in the above "df" which states the row of origin where the string(x) you input in "res". – hersh476 Aug 04 '17 at 15:23
The dataset I am applying your answer to contains 9000+ rows (including rows containing to no equipment). After extracting the equipments and unlisting them, there are 10,000+ elements for which I would like to know their row of origin. – hersh476 Aug 04 '17 at 15:29
I tried that exact do.call before you gave your solution but getting this error: "Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 0, 1" The error is because I forgot to mention that the first 3 rows of the dataset contain no data. x <- c(" ", " ", " ", "123AB123 Electrical CDe FG123-4 ...", "12/1/17 ABCD How are you today A123B") If you will run all the queries with the above string as input, the do.call will produce an error – hersh476 Aug 05 '17 at 22:43
Yeah, and did you use `str_extract_all`? I am on a mobile now and I can't help more with it, I will be back home in some hours. – Wiktor Stribiżew Aug 06 '17 at 09:07
x <- c(" ", " ", " ", "123AB123 Electrical CDe FG123-4 ...", "12/1/17 ABCD How are you today A123B", "20.9.12 Eat / Drink XY1234 for PQRS1", "Going home H123a1 ab-cd1", "Change channel for al1234 to al5678") x <- as.data.frame(as.matrix(x),stringsAsFactors=F) res <- str_extract_all(x$V1, "(?<!\\S)(?:(?=\\S*\\p{L})(?=\\S*\\d)\\S+|(?:\\S*\\p{Lu}){2}\\S*)") df1 <- unlist(lapply(res, function(c) paste(unlist(c), collapse=" "))) df1 <- as.data.frame(as.matrix(df1),stringsAsFactors=F) df2 <- unlist(res) df2 <- as.data.frame(as.matrix(df2),stringsAsFactors=F) – hersh476 Aug 07 '17 at 14:04
In the above lines of code, I want the unlisted elements of df2 to contain row tags referring to df1 which has the extracted names pasted back together after unlisting. Does that make sense? – hersh476 Aug 07 '17 at 14:04
Duplication of elements in both data frames will result in incorrect row tags so lets forget the idea of row tagging. Thank you for your help – hersh476 Aug 07 '17 at 15:13
Sorry, I was at the doctor. Hope it will work well enough. – Wiktor Stribiżew Aug 07 '17 at 15:40
Can you please provide a modification for your regmatches function which does the current alphanumeric string extraction (including special character " - ") that its currently doing but removes the identification of uppercase strings? INPUT: "123AB123 CDe FG123-4" "ABCD A123B" .... OUTPUT: "123AB123 FG123-4" "A123B" – hersh476 Aug 07 '17 at 19:55
1

@hersh476: I think you want [this](https://ideone.com/UbMgE6). `"(?<!\\S)(?=\\S*\\p{L})(?=\\S*\\d)\\S+"` extracts any 1+ non-whitespace symbol chunks that contain at least 1 letter and at least 1 digit (in any order). – Wiktor Stribiżew Aug 07 '17 at 20:00
Yes. I went through all kinds of regex posts before finally asking you, the author, for an alternate version of your function. Thank you – hersh476 Aug 07 '17 at 20:20
This kind of regex is rare. I can't even remember a similar question. – Wiktor Stribiżew Aug 07 '17 at 20:21

Extract alphanumeric words and words with more than 1 uppercase using R

2 Answers2