0

I have the following dataframe in R:

  
library(dplyr)
library(tsibble)
library(fpp3)

usda <- read.csv("https://raw.githubusercontent.com/rhozon/datasets/master/usda_data_stovwflw.csv", head = TRUE, sep = ";")  |> 
  mutate(
    Dates = case_when(
      CalendarYear - MarketYear == 1 ~ paste(CalendarYear,"-", Month),
      CalendarYear - MarketYear == 0 ~ paste(CalendarYear,"-", Month)
    ),
    Dates = gsub(" ", "", Dates)
  ) |>
  drop_na() |>
  mutate(
    Dates = yearmonth(Dates)
  ) |> arrange(Dates) |>
  select(
    Dates,
    AttributeDescription,
    Value
  ) |>
  glimpse()

Rows: 3,898
Columns: 3
$ Dates                <mth> 2010 jan, 2010 jan, 2010 jan, 2010 jan, 2010 jan, 2010 jan, 2010 jan, 2010 jan, 2010 jan, 2010 jan, 2010 jan, 2010 jan, 2010 jan, 2010 jan, 2010 jan, 20…
$ AttributeDescription <chr> "Production", "Area Harvested", "Yield", "Imports", "Exports", "Ending Stocks", "Total Distribution", "Beginning Stocks", "FSI Consumption", "TY Imports…
$ Value                <int> 334052, 32225, 10, 254, 52072, 44817, 376810, 42504, 138945, 300, 279921, 0, 376810, 52000, 140976, 334052, 32225, 42504, 43674, 50802, 282334, 376810, …

Now I´m trying to filter the data by some variable:

usda |> filter(AttributeDescription == "Production")
       Dates AttributeDescription  Value
1   2010 jan           Production 334052
2   2010 feb           Production 334052
3   2010 mar           Production 333533
4   2010 apr           Production 333533
5   2010 may           Production 339614
6   2010 may           Production 333011
7   2010 jun           Production 339614
8   2010 jun           Production 333011
9   2010 jul           Production 336438
10  2010 jul           Production 333011
...

As we can see the may-2010 are repeated, but the value is different.

How can I filter this dataframe preserving only those months that appear first from top to bottom, discarding those below it repeating, considering the different variables available in col AttributeDescription ?

1 Answers1

0

Solution:

usda_filtered <- usda |>
  arrange(Dates) |>
  group_by(Dates, AttributeDescription) %>%
  slice_head(n = 1)
  • If you are using dplyr 1.1.0+ you can use the `by` argument in the `slice_head` function and skip use of `group_by`. – LMc Mar 08 '23 at 23:32
  • If you use `group_by` though I recommend adding a pipe to `ungroup` since leaving a grouped data frame often yields unwanted results. – LMc Mar 08 '23 at 23:32