3
 datetime                                label            option_title option_value  lead difference

 1 2016-07-22                                   GE  3 - Commercial Review            3     2         -1
 2 2017-02-20                                   GE    2 - Solution Review            2     1         -1
 3 2017-02-20                                   GE 1 - Opportunity Review            1     2          1
 4 2017-04-18                                   GE    2 - Solution Review            2     3          1
 5 2017-04-19                                   GE  3 - Commercial Review            3     4          1
 6 2017-04-19                                   GE    4 - Submit Proposal            4     5          1
 7 2017-08-08                                   GE   5 - Proposal Awarded            5    NA         NA
 8 2016-08-02                                 HSBC   5 - Proposal Awarded            5     6          1
 9 2016-12-13                                 HSBC   6 - Delivery Phase 1            6     7          1
10 2017-08-07                                 HSBC       7 - Phase 1 Live            7    NA         NA
11 2016-07-22                                Lowes      Pre-Qualification            0    NA         NA
12 2016-08-02                          Danske Bank   6 - Delivery Phase 1            6    NA         NA
13 2016-07-22 AP Moller Maersk (IT Transformation)  3 - Commercial Review            3    NA         NA
14 2016-07-22                   BHP Billiton - APJ      Pre-Qualification            0     2          2
15 2016-07-26                   BHP Billiton - APJ    2 - Solution Review            2     0         -2
16 2016-07-26                   BHP Billiton - APJ      Pre-Qualification            0     2          2

I would like to create a new data frame off of this one that selects only the "labels" that have a negative "difference" value. However, I want to select ALL of the similar "labels" like so:

 datetime                                label            option_title option_value  lead difference

 1 2016-07-22                                   GE  3 - Commercial Review            3     2         -1
 2 2017-02-20                                   GE    2 - Solution Review            2     1         -1
 3 2017-02-20                                   GE 1 - Opportunity Review            1     2          1
 4 2017-04-18                                   GE    2 - Solution Review            2     3          1
 5 2017-04-19                                   GE  3 - Commercial Review            3     4          1
 6 2017-04-19                                   GE    4 - Submit Proposal            4     5          1
 7 2017-08-08                                   GE   5 - Proposal Awarded            5    NA         NA
 8 2016-07-22                   BHP Billiton - APJ      Pre-Qualification            0     2          2
 9 2016-07-26                   BHP Billiton - APJ    2 - Solution Review            2     0         -2
10 2016-07-26                   BHP Billiton - APJ      Pre-Qualification            0     2          2

I'm not sure how to do this in dplyr....would SQL be better for this? (I haven't used the sql package in R much)

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
rfortin
  • 184
  • 8

5 Answers5

2

Another possible way you could do this is with dplyr:

library(dplyr)

df %>% group_by(label) %>% filter(any(difference < 0))

#> # A tibble: 10 x 6
#> # Groups:   label [2]
#>      datetime              label           option_title option_value  lead
#>        <date>              <chr>                  <chr>        <int> <int>
#>  1 2016-07-22                 GE  3 - Commercial Review            3     2
#>  2 2017-02-20                 GE    2 - Solution Review            2     1
#>  3 2017-02-20                 GE 1 - Opportunity Review            1     2
#>  4 2017-04-18                 GE    2 - Solution Review            2     3
#>  5 2017-04-19                 GE  3 - Commercial Review            3     4
#>  6 2017-04-19                 GE    4 - Submit Proposal            4     5
#>  7 2017-08-08                 GE   5 - Proposal Awarded            5    NA
#>  8 2016-07-22 BHP Billiton - APJ      Pre-Qualification            0     2
#>  9 2016-07-26 BHP Billiton - APJ    2 - Solution Review            2     0
#> 10 2016-07-26 BHP Billiton - APJ      Pre-Qualification            0     2
#> # ... with 1 more variables: difference <int>

Data

library(readr)
df <- read_csv("rowid, datetime,                                label,            option_title, option_value,  lead, difference
1, 2016-07-22,                                   GE,  3 - Commercial Review,            3,     2,         -1
2, 2017-02-20,                                   GE,    2 - Solution Review,            2,     1,         -1
3, 2017-02-20,                                   GE, 1 - Opportunity Review,            1,     2,          1
4, 2017-04-18,                                   GE,    2 - Solution Review,            2,     3,          1
5, 2017-04-19,                                   GE,  3 - Commercial Review,            3,     4,          1
6, 2017-04-19,                                   GE,    4 - Submit Proposal,            4,     5,          1
7, 2017-08-08,                                   GE,   5 - Proposal Awarded,            5,    NA,         NA
8, 2016-08-02,                                 HSBC,   5 - Proposal Awarded,            5,     6,          1
9, 2016-12-13,                                 HSBC,   6 - Delivery Phase 1,            6,     7,          1
10, 2017-08-07,                                 HSBC,       7 - Phase 1 Live,            7,    NA,         NA
11, 2016-07-22,                                Lowes,      Pre-Qualification,            0,    NA,         NA
12, 2016-08-02,                          Danske Bank,   6 - Delivery Phase 1,            6,    NA,         NA
13, 2016-07-22, AP Moller Maersk (IT Transformation),  3 - Commercial Review,            3,    NA,         NA
14, 2016-07-22,                   BHP Billiton - APJ,      Pre-Qualification,            0,     2,          2
15, 2016-07-26,                   BHP Billiton - APJ,    2 - Solution Review,            2,     0,         -2
16, 2016-07-26,                   BHP Billiton - APJ,      Pre-Qualification,            0,     2,          2")

df <- df[-1]
markdly
  • 4,394
  • 2
  • 19
  • 27
1

If your dataframe is called df then this should do the trick:

aux <- df$label[df$difference < 0]
df2 <- df[df$label %in% aux,]

aux contains all labels where df$difference < 0. So df2 contains all rows from df with the labels of aux. Of course this could also be put as a single command:

df2 <- df[df$label %in% df$label[df$difference < 0],]

or

df <- df[df$label %in% df$label[df$difference < 0],]

A quick test:

> df
  label difference
1   test          2
2  test2          3
3  test2         -1
4  test3         -1
5  test4          4
6  test4          5

turns into the following df2:

> df2
  label difference
2  test2          3
3  test2         -1
4  test3         -1

As you can see the row numbering is wrong now. this is fixed using row.names(df2) <- 1:NROW(df2)

> df2
  label difference
1  test2          3
2  test2         -1
3  test3         -1
f.lechleitner
  • 3,554
  • 1
  • 17
  • 35
0

You could use an in clause on a subselect

select * from my_table
where label in (
  select label form my_table 
  where difference <0
)

or a join on a subselect

select * from my_table m
INNER JOIN (
  select label form my_table 
  where difference <0
) t on m-label = t.lable 
ScaisEdge
  • 131,976
  • 10
  • 91
  • 107
0

You can do it with R, no need to use the SQL package for this.

Sample data

difference <- c(1, -2, 3, -5)
labels <- c("a", "b", "c", "d")
df <- data.frame(a, b)

You can do a simple subset, in which you select the values with a negative difference:

minus_df <- subset(df, difference<0)

Finally, you create a list of the labels (you could do this directly in the previous step but it's always better to check whether the data is correct.

m_labels <- minus_df$labels 
Barbara
  • 1,118
  • 2
  • 11
  • 34
0

Try subset function

df <- subset(df, sign(df$diff) == -1)
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • 1
    This would only select rows where the difference is negative, but OP wants all the rows of a label which contain at least one negative difference. – f.lechleitner Nov 14 '17 at 09:55