0

I am trying to work on a for loop to make running a function I've developed more efficient. However, when I put it in a for loop, it is overwriting columns that it should not be and returning incorrect results.

Edit: The error is that in the resulting dataframe MiSeq_Bord_Outliers_table0, the resulting columns containing label Outlier_type is returning incorrect outputs.
As per the Outlier_Hunter function, when Avg_Trim_Cov and S2_Total_Read_Pairs_Processed are below their respective Q1 Thresholds their respective Outlier_type columns should read "Lower_Outlier", if between Q1 & Q3 Threshold, "Normal" and if above Q3 Threshold then "Upper_outlier". But when the for loop is executed, only "Upper_outlier" is shown in the Outlier_type columns.

Edit: The inputs have been simplified and tested on the different computer with a clean console. If there were any artifacts there before, they should have been eliminated now, and there should be no errors here now. It is important to run the outlier_results_1var part first. If you test run this code and get errors, please let me know which part failed.

Edit: MiSeq_Bord_Outliers_table0_error is the error that is being reproduced. This is the error result, not an input.

Can someone please tell me why is it returning these incorrect results and what I can do to fix it? I will upload the relevant code below. Or is there another way to do this without a for loop?

#libraries used

library(tidyverse)
library(datapasta)
library(data.table)
library(janitor)
library(ggpubr)
library(labeling)

#2.) Outlier_Hunter Function
#Function to Generate the Outlier table
#Outlier Hunter function takes 4 arguments: the dataset, column/variable of interest, 
#Q1 and Q3. Q1 and Q3 are stored in the results of Quartile_Hunter.
#Input ex: MiSeq_Bord_final_report0, Avg_Trim_Cov, MiSeq_Bord_Quartiles_ATC$First_Quartile[1], MiSeq_Bord_Quartiles_ATC$Third_Quartile[1]
#Usage ex: Outlier_Hunter(MiSeq_Bord_final_report0, Avg_Trim_Cov, 
#MiSeq_Bord_Quartiles_ATC$First_Quartile[1], MiSeq_Bord_Quartiles_ATC$Third_Quartile[1])

#Here is the Function to get the Outlier Table
Outlier_Hunter <- function(Platform_Genus_final_report0, my_col, Q1, Q3) {
  #set up and generalize the variable name you want to work with
  varname <- enquo(my_col)
  #print(varname) #just to see what variable the function is working with
  #get the outliers
  Platform_Genus_Variable_Outliers <- Platform_Genus_final_report0 %>%
    select(ReadID, Platform, Genus, !!varname) %>%
  #Tell if it is an outlier, and if so, what kind of outlier 
    mutate(
      Q1_Threshold = Q1,
      Q3_Threshold = Q3,
      Outlier_type = 
        case_when(
          !!varname < Q1_Threshold ~ "Lower_Outlier",
          !!varname >= Q1_Threshold & !!varname <=  Q3_Threshold ~ "Normal",
          !!varname > Q3_Threshold ~ "Upper_Outlier"
        )
    )
}
#MiSeq_Bord_Quartiles entries
MiSeq_Bord_Quartiles <- data.frame(
  stringsAsFactors = FALSE,
         row.names = c("Avg_Trim_Cov", "S2_Total_Read_Pairs_Processed"),
          Platform = c("MiSeq", "MiSeq"),
             Genus = c("Bord", "Bord"),
               Min = c(0.03, 295),
    First_Quartile = c(80.08, 687613.25),
            Median = c(97.085, 818806.5),
    Third_Quartile = c(121.5625, 988173.75),
               Max = c(327.76, 2836438)
)
#Remove the hashtag below to test if what you have is correct
#datapasta::df_paste(head(MiSeq_Bord_Quartiles, 5))
#dataset entry
MiSeq_Bord_final_report0 <- data.frame(
               stringsAsFactors = FALSE,
                                    ReadID = c("A005_20160223_S11_L001","A050_20210122_S6_L001",
                                               "A073_20210122_S7_L001",
                                               "A076_20210426_S11_L001",
                                               "A080_20210426_S12_L001"),
                                  Platform = c("MiSeq","MiSeq",
                                               "MiSeq","MiSeq","MiSeq"),
                                     Genus = c("Bordetella",
                                               "Bordetella","Bordetella",
                                               "Bordetella","Bordetella"),
                           Avg_Raw_Read_bp = c(232.85,241.09,
                                               248.54,246.99,248.35),
                       Avg_Trimmed_Read_bp = c(204.32,232.6,
                                               238.56,242.54,244.91),
                              Avg_Trim_Cov = c(72.04,101.05,
                                               92.81,41.77,54.83),
                 Genome_Size_Mb = c(4.1, 4.1, 4.1, 4.1, 4.1),
                            S1_Input_reads = c(1450010L,
                                               1786206L,1601542L,710792L,925462L),
                      S1_Contaminant_reads = c(12220L,6974L,
                                               7606L,1076L,1782L),
                    S1_Total_reads_removed = c(12220L,6974L,
                                               7606L,1076L,1782L),
                           S1_Result_reads = c(1437790L,
                                               1779232L,1593936L,709716L,923680L),
                     S2_Read_Pairs_Written = c(712776L,882301L,
                                               790675L,352508L,459215L),
             S2_Total_Read_Pairs_Processed = c(718895L,889616L,
                                               796968L,354858L,461840L)
 )

MiSeq_Bord_final_report0

#Execution for 1 variable
outlier_results_1var <- Outlier_Hunter(MiSeq_Bord_final_report0, Avg_Trim_Cov,
                                       MiSeq_Bord_Quartiles$First_Quartile[1], MiSeq_Bord_Quartiles$Third_Quartile[1])
#Now do it with a for loop
col_var_outliers <- row.names(MiSeq_Bord_Quartiles)
#col_var_outliers <- c("Avg_Trim_Cov", "S2_Total_Read_Pairs_Processed")
#change line above to change input of variables few into Outlier Hunter Function
outlier_list_MiSeq_Bord <- list()
for (y in col_var_outliers) {
  outlier_results0 <- Outlier_Hunter(MiSeq_Bord_final_report0, y, MiSeq_Bord_Quartiles[y, "First_Quartile"], MiSeq_Bord_Quartiles[y, "Third_Quartile"])
  outlier_results1 <- outlier_results0
  colnames(outlier_results1)[5:7] <- paste0(y, "_", colnames(outlier_results1[, c(5:7)]), sep = "")
  outlier_list_MiSeq_Bord[[y]] <- outlier_results1
}

MiSeq_Bord_Outliers_table0 <- reduce(outlier_list_MiSeq_Bord, left_join, by = c("ReadID", "Platform", "Genus"))

#the columns containing label Outlier_type is where the code goes wrong.  
#When Avg_Trim_Cov and S2_Total_Read_Pairs_Processed are below their 
#respective Q1 Thresholds their respective Outlier_type columns should read 
#"Lower_Outlier", if between Q1 & Q3 Threshold, "Normal" and if above Q3 
#Threshold then "Upper_outlier".  But when the for loop is executed, only 
"Upper_outlier" is shown in the Outlier_type columns.

datapasta::df_paste(head(MiSeq_Bord_Outliers_table0, 5))
MiSeq_Bord_Outliers_table0_error <- data.frame(
                            stringsAsFactors = FALSE,
                                      ReadID = c("A005_20160223_S11_L001",
                                                 "A050_20210122_S6_L001",
                                                 "A073_20210122_S7_L001","A076_20210426_S11_L001",
                                                 "A080_20210426_S12_L001"),
                                    Platform = c("MiSeq",
                                                 "MiSeq","MiSeq","MiSeq",
                                                 "MiSeq"),
                                       Genus = c("Bordetella","Bordetella","Bordetella",
                                                 "Bordetella","Bordetella"),
                                Avg_Trim_Cov = c(72.04,
                                                 101.05,92.81,41.77,54.83),
                   Avg_Trim_Cov_Q1_Threshold = c(80.08,
                                                 80.08,80.08,80.08,80.08),
                   Avg_Trim_Cov_Q3_Threshold = c(121.5625,
                                                 121.5625,121.5625,121.5625,
                                                 121.5625),
                   Avg_Trim_Cov_Outlier_type = c("Upper_Outlier","Upper_Outlier",
                                                 "Upper_Outlier","Upper_Outlier",
                                                 "Upper_Outlier"),
               S2_Total_Read_Pairs_Processed = c(718895L,
                                                 889616L,796968L,354858L,
                                                 461840L),
  S2_Total_Read_Pairs_Processed_Q1_Threshold = c(687613.25,
                                                 687613.25,687613.25,
                                                 687613.25,687613.25),
  S2_Total_Read_Pairs_Processed_Q3_Threshold = c(988173.75,
                                                 988173.75,988173.75,
                                                 988173.75,988173.75),
  S2_Total_Read_Pairs_Processed_Outlier_type = c("Upper_Outlier","Upper_Outlier",
                                                 "Upper_Outlier","Upper_Outlier",
                                                 "Upper_Outlier")
)

Diego
  • 23
  • 6
  • 2
    I've looked through this very long question and do not find answers the questions brought up by you asking " it is overwriting columns that it should not be and returning incorrect results". Which columns should have been altered and were not and which ones were altered shouldn't have been; and what is "incorrect"? – IRTFM Feb 12 '22 at 00:55
  • Also: `Error in enquo(my_col) : could not find function "enquo"`. So at lest one missing `library` call. Also you show output from a variety of function calls on `MiSeq_Bord_Outliers_table0` and `MiSeq_Bord_Quartiles` but never show dput on those objects themselves. So you should be trying to make this reproducible. Make it show something will happen after a cut-paste operation into an R console session. – IRTFM Feb 12 '22 at 00:56
  • Hello, I've just made an edit to the top. I think this will make more explicit what the problem is. I made a mention of it at the last section. I've also added the libraries used. As for the dput on the MiSeq_Bord_Quartiles and MiSeq_Bord_Outliers_table0, do you need me to put those in a ready made dataframe to make it easier? – Diego Feb 12 '22 at 03:02
  • 1
    Typically, data objects are best presented with dput. If they are large then sometimes it’s adequate to just present the head. If the head strategy is used, you may add clarity by also including the output of str. I’m not likely to return to this for another 12 hours, but maybe someone else will be monitoring active questions. ) – IRTFM Feb 12 '22 at 04:29
  • Ok, for the last 3 sections of code I used datapasta. The dataframes should now be saving to their respective dataframe names. Let me know if this works for you now. – Diego Feb 12 '22 at 14:23
  • I tried another way to do it. `MiSeq_Bord_final_report0_test <- MiSeq_Bord_Quartiles %>% split(rownames(.)) %>% Outlier_Hunter(MiSeq_Bord_final_report0, ., ~.[".", "First_Quartile"], ~.[".", "Third_Quartile"])` This might do the for loop too. It uses the purrr library. *Error: Must subset columns with a valid subscript vector. x Subscript has the wrong type `list`. ℹ It must be numeric or character. Run `rlang::last_error()` to see where the error occurred. * I think I'm calling getting the syntax wrong but Im not sure what I'm doing wrong. – Diego Feb 16 '22 at 13:18
  • 1
    When I run the recude operations I get the error: "Error in UseMethod("left_join") : no applicable method for 'left_join' applied to an object of class "NULL"". When I look at `outlier_list_MiSeq_Bord` I get `[[1]] NULL` as first elevemnt. – IRTFM Feb 16 '22 at 18:15
  • 1
    So you don't have a reproducible example, and it is surely not "minimal". If you wnat help here yopu will need to distill this down so it is smaller and reproducible, i.e. people can cut-and-paste code inot their console session and get it to run. You need to create a clean console session yourself with no `.Rdata` files in the background to pollute your global environment with lurking data objects that you have forgotten to include in your presentation. – IRTFM Feb 16 '22 at 18:24
  • You may find this vignette useful: https://rlang.r-lib.org/reference/topic-data-mask-programming.html#subsetting-the-data-pronoun – Mikko Marttila Feb 16 '22 at 19:08
  • @IRTFM I've just re-tested this whole code on a different computer and with a clean console. There should be no artifacts or .Rdata objects here. If you try it again, could you please let me know if you are able to generate the MiSeq_Bord_Quartiles (this is a 2 x 7 df) and MiSeq_Bord_final_report0 (this is a 5 x 13 df) dataframes, and if you can get the outlier_results_1var to run without any errors, or if you do get errors, where they are? – Diego Feb 16 '22 at 19:22
  • 1
    I get no errors. With the Outlier_Hunter call I get `outlier_results_1var$Outlier_type` retruning `[1] "Lower_Outlier" "Normal" "Normal" "Lower_Outlier" "Lower_Outlier"` – IRTFM Feb 16 '22 at 22:14

1 Answers1

1

For use in a loop like you do, it would be more useful to write your Outlier_Hunter() function to take the target column as a character string rather than an expression.

To do that, try replacing all instances of !!varname in your function with .data[[my_col]], and remove the enquo() line altogether.

Note that with these changes, you also need to change how you call the function when you don't have the column name in a variable. For example, your single execution would become:

Outlier_Hunter(
  MiSeq_Bord_final_report0,
  "Avg_Trim_Cov",
  MiSeq_Bord_Quartiles$First_Quartile[1], 
  MiSeq_Bord_Quartiles$Third_Quartile[1]
)

For more info about programming with tidy evaluation functions, you may find this rlang vignette useful.

Mikko Marttila
  • 10,972
  • 18
  • 31
  • Thanks, I've just run it with the edits and I get an error: 'Error: object 'Avg_Trim_Cov' not found' – Diego Feb 16 '22 at 19:39
  • 1
    @Diego ah yes I forgot to mention that with these changes, you can't use a bare name anymore for single execution. You'll need to use `"Avg_Trim_Cov"` there. – Mikko Marttila Feb 16 '22 at 19:45
  • Ok, great. Now it works for the outlier_results_1var stage. I'll test it on the loop and see how it goes. – Diego Feb 16 '22 at 19:51
  • Yes!!! It seems to work now. I'll keep testing it and double check it for QC, and comment again if something doesn't seem right but this seems to have done the trick. Thank you! – Diego Feb 16 '22 at 19:55
  • I'm glad to hear that! You're welcome. – Mikko Marttila Feb 16 '22 at 19:58