20

I need to subset a df to include certain strings. Some of these are full column names, and the following works fine:

testData[,c("FullColName1","FullColName2","FullColName3")]

My problem is that I need to expand this to also include column names that contain specific strings that may partially match to some other column names. These strings include letters and symbols:

"PartString1()","PartString2()"

I tried putting wildcards around these. (I've indicated this below with the prefix "star" because the "*" symbol didn't render correctly.)

testData[ ,c("FullColName1","FullColName2","FullColName3",
             "starPartString1()star","starPartString2()star")]

But I'm getting an error message: undefined columns selected. I can't figure out if or how I need grep to make this work.

Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
user3614783
  • 821
  • 6
  • 12
  • 20
  • 1
    What is "PartString1()"? The paired parentheses make me think you are using a function. It's NOT a valid R name. Try putting in code that constructs what you are working with. – IRTFM Jun 12 '14 at 04:49
  • "PartString1()" is a variable name. I was trying to use grep to match on partial column names that included specific char strings and a set of parentheses. Problem solved with df2<-df1[, grepl("FullColName1|FullColName2|FullColName3|PartSting1[[:punct:]]|PartString2[[:punct:]]",names(df1))] – user3614783 Jun 14 '14 at 04:08

3 Answers3

17

You mentioned you may be looking for symbols, so for this particular example we can use [[:punct:]] as our regular expression. This will find all the strings with punctuation symbols in the column names.

d <- data.frame(1:3, 3:1, 11:13, 13:11, rep(1, 3))
names(d) <- c("FullColName1", "FullColName2", "FullColName3",
              "PartString1()","PartString2()")

d[grepl("[[:punct:]]", names(d))]
#   PartString1() PartString2()
# 1            13             1
# 2            12             1
# 3            11             1

This last part just illustrates another way to do this with other string processing functions from stringr

library(stringr)
d[str_detect(names(d), "[[:punct:]]")]
#   PartString1() PartString2()
# 1            13             1
# 2            12             1
# 3            11             1

ADD per OPs comment

d[grepl("ring[12()]", names(d))]

to get either of the substrings ring1() or ring2() from the names vector

Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
  • Thanks very much Richard. I did not know about [[:punct:]]. This almost solves my problem, but I need to select all columns that have the substring ring1() or the substring ring2(). Just searching for punctuation is too broad. – user3614783 Jun 12 '14 at 13:59
  • That would just be `d[, grepl("ring[12()]", names(d))]`. Would you like me to change that in my answer? – Rich Scriven Jun 12 '14 at 14:07
  • I came up with d[, grep("ring1[[:punct:]]|ring2[[:punct:]]",names(d))], but yours is more elegant. Thanks again. – user3614783 Jun 12 '14 at 14:12
9

You can use grep to find indices of column names with partial match to a particular pattern

require(PerformanceAnalytics)
data(managers)

colnames(managers)
#[1] "HAM1"        "HAM2"        "HAM3"        "HAM4"        "HAM5"       
#[6] "HAM6"        "EDHEC LS EQ" "SP500 TR"    "US 10Y TR"   "US 3m TR"

suppose the pattern you want to match is "HAM", along with some fixed column names ("SP500 TR" "US 10Y TR" "US 3m TR")

head(managers[,c("SP500 TR","US 10Y TR","US 3m TR",colnames(managers)[grep("HAM",colnames(managers))])])
#           SP500 TR US 10Y TR US 3m TR    HAM1 HAM2    HAM3    HAM4 HAM5 HAM6
#1996-01-31   0.0340   0.00380  0.00456  0.0074   NA  0.0349  0.0222   NA   NA
#1996-02-29   0.0093  -0.03532  0.00398  0.0193   NA  0.0351  0.0195   NA   NA
#1996-03-31   0.0096  -0.01057  0.00371  0.0155   NA  0.0258 -0.0098   NA   NA
#1996-04-30   0.0147  -0.01739  0.00428 -0.0091   NA  0.0449  0.0236   NA   NA
#1996-05-31   0.0258  -0.00543  0.00443  0.0076   NA  0.0353  0.0028   NA   NA
#1996-06-30   0.0038   0.01507  0.00412 -0.0039   NA -0.0303 -0.0019   NA   NA

you can specify multiple patterns using, grep("pattern1 | pattern2 ", colnames(data))

Silence Dogood
  • 3,587
  • 1
  • 13
  • 17
  • thanks for showing me your multiple pattern example. I had been putting each of the patterns in quotes, rather than putting quotes around the entire set of patterns. – user3614783 Jun 12 '14 at 14:21
  • In my application, I had to write ``grep("pattern1|pattern2 ", colnames(data))``, *without* a space either side of ``|`` – PatrickT Feb 03 '16 at 16:45
3

You can use grepl for a search by column name. It returns a logical vector indicating matches.

Here is an example:

d <- read.table(header=TRUE, check.names=FALSE,
                text="1PartString()2 1PartString()3 OtherCol
                1 2 3
                3 4 5")
d
##   1PartString()2 1PartString()3 OtherCol
## 1              1              2        3
## 2              3              4        5

d[,grepl("PartString\\(\\)", names(d))]
##   1PartString()2 1PartString()3
## 1              1              2
## 2              3              4

grepl check to see if the pattern is present anywhere in the name, so a wildcard is not required.

Matthew Lundberg
  • 42,009
  • 6
  • 90
  • 112
  • Or just use patt="PartString". The notion of putting paired parentheses in dataframe column names just seems so wrong. – IRTFM Jun 12 '14 at 06:37