0

This is regarding R programming My text string is

Application Games|Real Time|Social Media

Objective :- I want to keep the everything before the first occurrence of the pipe symbol and discard everything after that I have used

library(stringr)  
cat <- df$category  
matches_cat <- str_match(cat,"(\\w+)|")   

It is working fine but when it comes to this Text string

E-Commerce|Cryogenesis|Real Estate    

Output is only the word E.

gsub("\\|w+,"",cat)  

is also not able to replace some how. I am totally new to R so what should i do in this case?

989
  • 12,579
  • 5
  • 31
  • 53
Pratik
  • 1
  • 2
  • Possible duplicate of [R remove part of string after "."](http://stackoverflow.com/questions/10617702/r-remove-part-of-string-after). Another possible duplicate: [Character extraction from string](http://stackoverflow.com/questions/14790253/character-extraction-from-string) – Jota Jul 10 '16 at 22:05

2 Answers2

1

We can use sub to match the metacharacter (|) followed by rest of the characters to the end of the string (.*), and replace it with "".

sub("\\|.*", "", str1)
#[1] "Application Games" "E-Commerce" 

This can also be done with capture groups to match all characters that are not |, capture as a group and in the replacement use the backreference for that group

sub("^([^|]+)\\|.*", "\\1", str1)
#[1] "Application Games" "E-Commerce"   

If we need a package solution, str_extract can be used as well

library(stringr)
str_extract(str1, "[^|]+")
#[1] "Application Games" "E-Commerce"    

Or using word

word(str1, 1, sep="[|]")
#[1] "Application Games" "E-Commerce"       

NOTE: Here also, I showed compact code as well as base R methods without splitting or looping

Benchmarks

str2 <- rep(str1, 1e5)
system.time(sub("\\|.*", "", str2) )
#   user  system elapsed 
#   0.20    0.00    0.21 
system.time(str_extract(str2, "[^|]+") )
#   user  system elapsed 
#  0.08    0.00    0.08 

 system.time({
 l <- strsplit(str2,"\\|")
 sapply(1:length(l), function(i) l[[i]][1])
 })
 #   user  system elapsed 
 #   0.5     0.0     0.5 

data

str1 <-  c("Application Games|Real Time|Social Media", 
             "E-Commerce|Cryogenesis|Real Estate")
akrun
  • 874,273
  • 37
  • 540
  • 662
1

You can use strsplit() to separate the string, then select just the first part

s1 <- "Application Games|Real Time|Social Media"
strsplit(s1,"\\|")[[1]][1]
#[1] "Application Games"

To apply this to a vector of strings you can use apply to extract the first element of each slice.

l <- strsplit(str1,"\\|")
sapply(1:length(l), function(i) l[[i]][1])
#[1] "Application Games" "E-Commerce"   
dww
  • 30,425
  • 5
  • 68
  • 111