0

I have 16 dataframes I am trying to quality check and delete poor quality rows in R. I already know of lapply() and have used it for simpler wrangling problems to apply the same thing to all my dataframes at once, but for whatever reason I'm having a mental block currently.

The format of each individual dataframe is like so, where every other column contains a "flags" column. The flags column contains strings of values. If any of the values in the string are a 4, I want to filter those rows out of the dataframe.

head(df)

timestamp    wind_speed_max wind_speed_max_flags   wind_speed_mean
1            UTC meters per second                  NAN meters per second
2    data logger   Airmar WS-200WX                  NAN   Airmar WS-200WX
3 6/2/2015 15:46               7.6              1 1 4 1              5.12
4 6/2/2015 16:01               7.2              1 1 1 1              5.16
5 6/2/2015 16:16               8.1              1 1 1 1              5.97
6 6/2/2015 16:31               8.5              1 1 1 1             5.909
  wind_speed_mean_flags wind_direction_mean wind_direction_mean_flags
1                   NAN             degrees                       NAN
2                   NAN     Airmar WS-200WX                       NAN
3               1 1 1 1               57.14                   1 2 1 2
4               1 1 1 1               61.64                   1 2 1 4
5               1 1 1 1                  68                   1 2 1 2
6               4 1 1 1               73.14                   1 2 1 2

I know I can try to grep("flags") for the column names, and I also think I could use a similar grep method to filter out the strings containing a 4? Perhaps using some Boolean operators. But I am struggling to piece all of this together to retain the rest of the data, and to ideally perform this at the same time for all 16 dataframes for example lapply(df_list, function(x) <insert code that can filter out flags with 4s for each x dataframe>)

abby23
  • 3
  • 1
  • 1
    Just posted an answer, but I didn't test as your sample data is in an annoying-to-import format due to all the spaces and such in the data. If you share `dput(head(df))` that would provide copy/pasteable code to provide an exact replica of the head of your `df`, and make testing my answer much easier :) – Gregor Thomas Feb 23 '23 at 19:16

1 Answers1

1

Let's start by writing code to filter one data frame - we'll look at the columns that include "flags" in the name and grep for "4". Then we'll use rowSums to count the number of 4s in each row, keeping only rows with 4 count == 0.

# count the number of 4s in each row of "flag" cols of `df`
count_4 = df[grepl("flags", names(df))] |>
  sapply(grepl, pattern = "4") |>
  rowSums(na.rm = TRUE)

Putting it in lapply:

modified_data_list = lapply(data_list, function(df) {
  count_4 = df[grepl("flags", names(df))] |>
    sapply(grepl, pattern = "4") |>
    rowSums(na.rm = TRUE)
  df[count_4 == 0, ]
})
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294