0

i have a dataframe like this :

         ds        y
1   2015-12-31 35.59050
2   2016-01-01 28.75111
3   2016-01-04 25.53158
4   2016-01-06 17.75369
5   2016-01-07 29.01500
6   2016-01-08 29.22663
7   2016-01-09 29.05249
8   2016-01-10 27.54387
9   2016-01-11 28.05674
10  2016-01-12 29.00901
11  2016-01-13 31.66441
12  2016-01-14 29.18520
13  2016-01-15 29.79364
14  2016-01-16 30.07852

i'm trying to create a loop that remove the rows which values in the 'ds' column are above 34 or below 26, because there is where my outliers are:

for (i in grupo$y){if (i < 26) {grupo$y[i] = NA}}

i tried this to remove those below 26, i don't get any errors, but those rows won't go.

Any suggestions about how to remove those outliers??

Thanks in advance

merv
  • 67,214
  • 13
  • 180
  • 245
Miguel 2488
  • 1,410
  • 1
  • 20
  • 41

2 Answers2

3

Here are a base R solution and a tidyverse solution. Part of the strength of R is that for a problem such as this one, R's default of working across vectors means you often don't need a for loop. The issue is that in your loop, you're assigning values to NA. That doesn't actually get rid of those values, it just gives them the value NA.

In base R, you can use subset to get the rows or columns of a data frame that meet certain criteria:

subset(grupo, y >= 26 & y <= 34)
#> # A tibble: 11 x 2
#>    ds             y
#>    <date>     <dbl>
#>  1 2016-01-01  28.8
#>  2 2016-01-07  29.0
#>  3 2016-01-08  29.2
#>  4 2016-01-09  29.1
#>  5 2016-01-10  27.5
#>  6 2016-01-11  28.1
#>  7 2016-01-12  29.0
#>  8 2016-01-13  31.7
#>  9 2016-01-14  29.2
#> 10 2016-01-15  29.8
#> 11 2016-01-16  30.1

Or using dplyr functions, you can filter your data similarly, and make use of dplyr::between. between(y, 26, 34) is a shorthand for y >= 26 & y <= 34.

library(dplyr)

grupo %>%
  filter(between(y, 26, 34))
#> # A tibble: 11 x 2
#>    ds             y
#>    <date>     <dbl>
#>  1 2016-01-01  28.8
#>  2 2016-01-07  29.0
#>  3 2016-01-08  29.2
#>  4 2016-01-09  29.1
#>  5 2016-01-10  27.5
#>  6 2016-01-11  28.1
#>  7 2016-01-12  29.0
#>  8 2016-01-13  31.7
#>  9 2016-01-14  29.2
#> 10 2016-01-15  29.8
#> 11 2016-01-16  30.1
camille
  • 16,432
  • 18
  • 38
  • 60
  • Thanks a lot for your very complete solution Camille, it works just fine, and helped to understand better the use of these functionalities in R. :D – Miguel 2488 Jun 11 '18 at 13:59
2

With dplyr you could do:

library(dplyr)
df %>% 
filter(y >= 26 & y <= 34)

       ds        y
1  2016-01-01 28.75111
2  2016-01-07 29.01500
3  2016-01-08 29.22663
4  2016-01-09 29.05249
5  2016-01-10 27.54387
6  2016-01-11 28.05674
7  2016-01-12 29.00901
8  2016-01-13 31.66441
9  2016-01-14 29.18520
10 2016-01-15 29.79364
11 2016-01-16 30.07852
Lennyy
  • 5,932
  • 2
  • 10
  • 23