How to avoid code duplication in R

Question

I've been using R for a while now, but find myself using those things over and over again. I guess there must be an easier way than to write:

filtered <- names(table(col))[!match.something(names(table(col)))]

Where match.something returns an T/F-list.

Of course I could do the following to make it more concise and remove some duplication:

x <- names(table(col))
filtered <- x[!match.something(x)]

But it still feels like there should be a different/easier way to select like this. Something like subset(x, fun), to avoid having to type "x" twice. Which function am I looking for?

Example:

> col <- c("AA", "AA", "GG", "GG", "AA", "AG")
> x <- names(table(col))
> x[match.bases(x)]
[1] "AA" "GG"

This is the function I'm using right now:

BASES = c("AA", "GG", "CC", "TT")

match.bases <- function(df) {
  df %in% BASES
}

Thanks, I tried `select(x, match.bases)` but got: Error in UseMethod("select_") : no applicable method for 'select_' applied to an object of class "character". filter doesn't seem to work either. I probably should start with an dplyr tutorial to get there? — exic, Oct 08 '17 at 10:27
What exactly are you trying to do? What is `col`, why use `table`? Could you post a complete example? — AkselA, Oct 08 '17 at 10:38
Related: [*Difference between `%in%` and `==`*](https://stackoverflow.com/questions/15358006/difference-between-in-and) — Jaap, Oct 08 '17 at 10:42
Perhaps your `match.bases` function is much more complicated, but just seeing this example I would suggest something like `intersect(x, c("AA", "GG", "CC", "TT"))` - not only shorter, but also much more efficient. — Patrick Roocks, Oct 08 '17 at 10:48
Thanks for giving an alternative, I was also using the `match.bases` for something like `data[!match.bases(data)] <- NA` but it might make sense to define a variable like `BASES` and use that in both cases. — exic, Oct 08 '17 at 10:54
@AkselA I extended my example, probably there is a better way to aggregate the values like I'm doing with table/names. Which is only a side problem, though. @Jaap thanks, I rewrote my match.bases function. I wrote that quite a while ago, that's why I weren't using `%in%` yet. — exic, Oct 08 '17 at 10:59
There's also no need for `return()`. In the real data is `col` the column names of a data.frame or something similar? — AkselA, Oct 08 '17 at 11:05
col is a column of a data frame, containing SNP data produced by genetic markers (don't ask me about details, I'm programming it for my wife who is doing the actual interpretation). — exic, Oct 08 '17 at 11:10
If you have defined match.bases function, why don't go further and define other subsets functions to be reused ? e.g. something like `only.bases <- function(df){ df[match.bases(df)] }` and `allexcept.bases <- function(df){ df[!match.bases(df)] }` — digEmAll, Oct 08 '17 at 11:11
Ok, so you want to return rows where `dtf$col %in% c("AA", "GG", "CC", "TT")` is `TRUE`? — AkselA, Oct 08 '17 at 11:27
In this example yes, and I want to avoid having to type "dtf$col" more than once. I usually see myself writing a condition like `dtf$col %in% c("AA", "GG", "CC", "TT")` and then surrounding it with `dtf$col[ ... ]` to get the actual data. Which is even more cumbersome if `dft$col` is a variable with a longer name — exic, Oct 08 '17 at 11:33

Paul · Accepted Answer · 2017-10-08T11:24:20.327

2

In this case, you want purrr::keep which will only keep elements that satisfy a predicate.

purrr::keep(x, match.bases)
# [1] "AA" "GG"

Edit

As, digEmAll commented, the base R equivalent is Filter(match.bases, x).

edited Oct 08 '17 at 11:24

answered Oct 08 '17 at 10:38

Paul

8,734
1
26
36

That does the job, however I was hoping for something that doesn't need an external library. Can this be done with something more basic or at least a library that is more widespread (e.g. I haven't heard of purrr before, but have encountered dplyr quite often)? – exic Oct 08 '17 at 11:06

AkselA · Answer 2 · 2017-10-08T12:45:26.990

Assuming the goal is to subset rows based on values in col. I don't know if there is a more concise way to do it in base R.

col <- c("AA", "AA", "GC", "TG", "GG", "GG", "AA", "AG")
dtf <- data.frame(col, val=1:8)

dtf[dtf$col %in% c("AA", "GG", "CC", "TT"),]

# or using your function
dtf[match.bases(dtf$col),]

#   col val
# 1  AA   1
# 2  AA   2
# 5  GG   5
# 6  GG   6
# 7  AA   7

Although if you only want to keep the subset val, it can be done like this. No repeating of names.

with(dtf, val[col %in% c("AA", "GG", "CC", "TT")])

# or using your function
with(dtf, val[match.bases(col)])

# 1 2 5 6 7

Also, if you want a function like subset(x, FUN), you can just make one:

sbst <- function(x, FUN, ...) {
    x[FUN(x, ...)]
}

sbst(x, match.bases)
# "AA" "GG"

score 0 · Answer 3 · answered Oct 08 '17 at 11:19

First, you can use directly the names to retrieve the data from a table object and do not need your match function at all:

col   <- c("AA", "AA", "GG", "GG", "AA", "AG")
bases <- c("AA", "GG", "CC", "TT")
table(col)[bases]

And if you want to prevent having NAs, you may use intersect:

col   <- c("AA", "AA", "GG", "GG", "AA", "AG")
bases <- c("AA", "GG", "CC", "TT")
table(col)[intersect(col, bases)]

How to avoid code duplication in R

3 Answers3

Edit