41

I have a data set wherein a column looks like this:

ABC|DEF|GHI,  
ABCD|EFG|HIJK,  
ABCDE|FGHI|JKL,  
DEF|GHIJ|KLM,  
GHI|JKLM|NO|PQRS,  
BCDE|FGHI|JKL  

.... and so on

I need to extract the characters that appear before the first | symbol.

In Excel, we would use a combination of MID-SEARCH or a LEFT-SEARCH, R contains substr().

The syntax is - substr(x, <start>,<stop>)

In my case, start will always be 1. For stop, we need to search by |. How can we achieve this? Are there alternate ways to do this?

lmo
  • 37,904
  • 9
  • 56
  • 69
  • 1
    `?regexpr` returns the index of the first match that can be used as your "stop" argument -- `regexpr("|", x, fixed = TRUE) - 1` – alexis_laz Jul 10 '16 at 12:42

3 Answers3

67

We can use sub

sub("\\|.*", "", str1)
#[1] "ABC"

Or with strsplit

strsplit(str1, "[|]")[[1]][1]
#[1] "ABC"

Update

If we use the data from @hrbrmstr

sub("\\|.*", "", df$V1)
#[1] "ABC"   "ABCD"  "ABCDE" "DEF"   "GHI"   "BCDE" 

These are all base R methods. No external packages used.

data

str1 <- "ABC|DEF|GHI ABCD|EFG|HIJK ABCDE|FGHI|JKL DEF|GHIJ|KLM GHI|JKLM|NO|PQRS BCDE|FGHI|JKL"
akrun
  • 874,273
  • 37
  • 540
  • 662
25

Another option word function of stringr package

library(stringr)
word(df1$V1,1,sep = "\\|")

Data

df1 <- read.table(text = "ABC|DEF|GHI,  
ABCD|EFG|HIJK,  
ABCDE|FGHI|JKL,  
DEF|GHIJ|KLM,  
GHI|JKLM|NO|PQRS,  
BCDE|FGHI|JKL")
user2100721
  • 3,557
  • 2
  • 20
  • 29
5

with stringi:

library(stringi)

df <- read.table(text="ABC|DEF|GHI,1
ABCD|EFG|HIJK,2
ABCDE|FGHI|JKL,3  
DEF|GHIJ|KLM,4
GHI|JKLM|NO|PQRS,5
BCDE|FGHI|JKL,6", sep=",", header=FALSE, stringsAsFactors=FALSE)

stri_match_first_regex(df$V1, "(.*?)\\|")[,2]
## [1] "ABC"   "ABCD"  "ABCDE" "DEF"   "GHI"   "BCDE" 
hrbrmstr
  • 77,368
  • 11
  • 139
  • 205