3

I use r, and I'm looking to use regular expressions to calculate the row sums for the amount of occurrences of a string pattern that occurs across all columns in data frame containing epigenetic information. There are 40 columns, 15 of which may or may not contain the pattern of interest. The code that has got me closest to what I'm looking for is:

# Looking to match following exact pattern ',.,' which will always be 
# preceded and followed by a sequence of characters or numbers.
# Note: the full stop in the pattern above signifies any character

df$rowsum <- rowSums(apply(df, 2, grep, pattern = ".*,.,.*"))

For each row, this provides a count of the columns that contain the pattern, however the issue I have is that any individual cell can contain this pattern more than once. I've tried several different function combinations to try to get to the answer, and realise that grep probably is not the solution as it spits out a logical whenever it finds the pattern, meaning it can only report a maximum of one pattern match for any particular cell. I need a solution that counts every occurrence of the pattern within each individual cell in a single row, and adds these values to provide a row sum total. This total is added rowsum column of that particular row.

For context a typical individual occurrence of the contents of a particular cell could be:

2212(AATTGCCCCACA,-,0.00)

Whereas if there were multiple occurrences they would exist in the cell as a continuous string each entry separated by a comma, for example for two entries:

144(TGTGAGTCAC,+,0.00),145(GTGAGTCACT,-,0.00)

I'm using the ,., as the unique identifier of each entry, as everything else for each entry is variable.

Here is some toy data:

df <-data.frame(NAMES = c('A', 'B', 'C', 'D'), 
            GENE1 = c("144(TGTGAGTCAC,+,0.00),145(GTGAGTCACT,-,0.00)", "2(TGTGAGTCAC,+,0.00)", "NA", "NA"), 
            GENE2 = c("632(TAAAGAGTCAC,-,0.00),60(GTCCCTCACT,-,0.00),", "7(TGTGAGTCAC,+,0.00)", "7(TGTGAGTCAC,+,0.00)", "NA"),  
            stringsAsFactors = F)

The optimum code would provide a data frame with a row sums column attached with totals:

# Omitted GENE column contents to save space

NAMES    GENE1     GENE2     rowsum
  A       ...       ...         4
  B       ...       ...         2
  C       ...       ...         1
  D       ...       ...         0

Been stumped on this for 48 hrs. Any help would be greatly appreciated.

esote
  • 831
  • 12
  • 25
Darren
  • 277
  • 4
  • 17

1 Answers1

1

We can use str_extract from stringr

library(stringr)
df$rowsum <- Reduce(`+`, lapply(df[-1], 
        function(x) lengths(str_extract_all(x, "\\d+\\("))))
df$rowsum
#[1] 4 2 1 0
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Hi @akrun, not an advance R user so finding it difficult to interpret your response. Your code works for the toy data but not my full data set, it seems to overestimate the row sums. – Darren Nov 22 '16 at 12:43
  • @Darren I am using the regex to match the numbers before the `(`. So, in the example, the first row have 4 such instances, second, 2, and so on. – akrun Nov 22 '16 at 12:47
  • Messed around with this for a while, turned out the overestimation was caused by NAs that R had assigned when I imported the file. The NAs I sent in the toy data were sent as strings, so they didn't cause the same problem. When I changed the NAs to blank strings the overestimated row sums vanished giving me the answers expected. Any idea why this occurred? Thanks you for your help, much appreciated!! – Darren Nov 22 '16 at 14:43
  • @Darren I think NAs will be automatically read as missing values. Check if there is any space-leading/lagging spaces or any other characters. In that case, you can specify that character in the `na.strings` i.e. `read.csv("yourfile.csv", na.strings = "NA")` – akrun Nov 22 '16 at 14:45
  • 1
    Thanks @akrun! Your code is working fine for me now. – Darren Nov 22 '16 at 15:15