1

I have two df in R (meta=some redundant info)

df1:

                id  value1  value2  value3  value4
id1_meta_meta-meta  4.93    13.93   16.8    35.39
id2_meta_meta-meta  28.63   45.43   30.52   61.71
id3_meta_meta-meta  3.35    1.26    7.98    4.43
id4_meta_meta-meta  16.78   50.47   32.48   55.52
id5_meta_meta-meta  474.23  807.71  664.45  442.55
id6_meta_meta-meta  26.26   32.83   24.64   41.58
id7_meta_meta-meta  230.1   202.93  166.71  295.48
id8_meta_meta-meta  651.21  1282.71 1012.28 2650.21

df2:

V1
id1
id2
id3
id4
id5

Question

Trying to filter rows in df1 based on ids in df2

Code

library(dplyr)
library(stringr)
df.common = df1 %>%
  filter(str_detect(id, '*_') %in% df2$V1)

error

Error in filter_impl(.data, quo) : 
  Evaluation error: Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX).

Desired output

df.common:

                id  value1  value2  value3  value4
id1_meta_meta-meta  4.93    13.93   16.8    35.39
id2_meta_meta-meta  28.63   45.43   30.52   61.71
id3_meta_meta-meta  3.35    1.26    7.98    4.43
id4_meta_meta-meta  16.78   50.47   32.48   55.52
id5_meta_meta-meta  474.23  807.71  664.45  442.55
www
  • 38,575
  • 12
  • 48
  • 84
sbradbio
  • 169
  • 1
  • 13
  • Your original code will work if you change the `filter` condition to `filter(str_detect(id, df2$V1))` – Jake Kaupp Aug 17 '17 at 16:24
  • @JakeKaupp I get this error `Warning message: In stri_detect_regex(string, pattern, opts_regex = opts(pattern)) : longer object length is not a multiple of shorter object length` – sbradbio Aug 17 '17 at 16:27
  • It's a warning, not an error, and results in your desired output. – Jake Kaupp Aug 17 '17 at 16:33
  • true, rookie mistake apologies but i do not get what I expected `> dim(df.common) [1] 2 13` – sbradbio Aug 17 '17 at 16:36
  • 1
    `str_detect` detects strings and returns TRUE of FALSE, so your code is looking for TRUE or FALSE in `df2`. Instead, use `str_extract` to pull out the ID part and then test with that: `str_extract(id, "id[0-9]+") %in% df2$V1`. – Gregor Thomas Aug 17 '17 at 16:46

2 Answers2

4

If you are using dplyr and stringr, you can also consider this approach. str_replace_all is like gsub. semi_join is a kind of "filter-join" allowing you to keep records only found match in df2.

library(dplyr)
library(stringr)

df3 <- df1 %>%
  mutate(id2 = str_replace_all(id, "_.*", "")) %>%
  semi_join(df2, by = c("id2" = "V1")) %>%
  select(-id2)

df3
                  id value1 value2 value3 value4
1 id1_meta_meta-meta   4.93  13.93  16.80  35.39
2 id2_meta_meta-meta  28.63  45.43  30.52  61.71
3 id3_meta_meta-meta   3.35   1.26   7.98   4.43
4 id4_meta_meta-meta  16.78  50.47  32.48  55.52
5 id5_meta_meta-meta 474.23 807.71 664.45 442.55
www
  • 38,575
  • 12
  • 48
  • 84
  • I will try this, but correct me if I am wrong @PoGibas answer is one liner and concise. – sbradbio Aug 17 '17 at 16:29
  • Well... if you only want to see the most concise answer, I will delete my answer shortly. If you want to learn more about the use of `dplyr` and `stringr` since you are using these packages, I will keep my answer here as an optional approach. What do you say? – www Aug 17 '17 at 16:32
  • Sure I have accepted it absolutely your are correct it can be optional way. – sbradbio Aug 17 '17 at 16:35
2
  1. Use gsub to trim id in df1

    • gsub("_.*", "", df1$id) will remove everything after _
  2. Check what trimmed id's are in df2$V2 (this will return row numbers)

  3. Extract those rows from df1

    df1[gsub("_.*", "", df1$id) %in% df2$V2, ]
    
pogibas
  • 27,303
  • 19
  • 84
  • 117