Issues spreading data after removing outliers

Question

I put my data into long format, in order to remove outliers (I grouped by grade and condition, and then removed by 1.5 * IQR), however, I'm having issues getting it back into wide format. The final two columns (condition and variable (36, 37) are what I want to spread the data by (so that BOXED_Conjunction_12 becomes a variable). Variables 1:35 should remain as they are. (There will be NAs introduced given that outliers were removed; however, outliers were only removed per condition and not completely). I think I'm having issues because of removing the outliers, but I would think that fill = NA would solve this issue. Can't figure it out.

I've tried

dat%>%spread(condition,pid.avg_rw, fill = NA)

I've also tried using reshape2:

dat%>%dcast((1:35)~ condition, value.var = "pid.avg_rw")

and I get the error

number of rows of result is not a multiple of vector length (arg 1)Aggregation function missing: defaulting to length

Here's a dput of the first ten lines.

Thanks much,

James

structure(list(pid = c("ADMIN-UCSF-bo002", "ADMIN-UCSF-bo002", 
"ADMIN-UCSF-bo002", "ADMIN-UCSF-bo002", "ADMIN-UCSF-bo002", "ADMIN-UCSF-bo002", 
"ADMIN-UCSF-bo002", "ADMIN-UCSF-bo002", "ADMIN-UCSF-bo002", "ADMIN-UCSF-bo002"
), timepoint = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1), District.ID = c(175420L, 
175420L, 175420L, 175420L, 175420L, 175420L, 175420L, 175420L, 
175420L, 175420L), School = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L), .Label = c("Bowers", "Bracher", "Cabrillo", 
"Central Park", "Laurelwood", "Millikin", "Peterson"), class = "factor"), 
    Ethnicity = structure(c(6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
    6L), .Label = c("American Indian or Alaskan Native", "Asian", 
    "Black or African American", "Blank on Purpose", "Filipino", 
    "Hispanic or Latino", "Pacific Islander", "Two or More Races", 
    "White"), class = "factor"), Age.2018 = c(10L, 10L, 10L, 
    10L, 10L, 10L, 10L, 10L, 10L, 10L), Sex = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("F", "M"), class = "factor"), 
    Language.Fluency = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L), .Label = c("English Learner", "English Only", 
    "IFEP-Initially Fluent", "RFEP-Redesignated"), class = "factor"), 
    Parent.Ed.Lvl = structure(c(5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
    5L, 5L), .Label = c("College Graduate", "Declined to state/Unknown", 
    "Grad School/post grad trng", "High School Graduate", "Not HS Graduate", 
    "Some College"), class = "factor"), SpEd = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("No", "Yes"
    ), class = "factor"), SpEd.Dis = structure(c(1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("", "Autism (AUT)", 
    "Emotional Disturbance (ED)", "Hard of Hearing (HH)", "Intellectual Disability (ID)", 
    "Other Health Impairment (OHI)", "Specific Learning Disability (SLD)", 
    "Speech or Language Impairment (SLI)", "Visual Impairment (VI)"
    ), class = "factor"), Low.Income = structure(c(2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("No", "Yes"), class = "factor"), 
    grade = c("3", "3", "4", "3", "3", "4", "3", "3", "4", "3"
    ), gender = c("F", "F", "2", "F", "F", "2", "F", "F", "2", 
    "F"), Teacher = c("Keith, Susan", "Keith, Susan", "Lourdes Martin", 
    "Keith, Susan", "Keith, Susan", "Lourdes Martin", "Keith, Susan", 
    "Keith, Susan", "Lourdes Martin", "Keith, Susan"), time = structure(c(17113, 
    17263, 17417, 17113, 17263, 17417, 17113, 17263, 17417, 17113
    ), class = "Date"), ela.score = c(2424, 2424, NA, 2424, 2424, 
    NA, 2424, 2424, NA, 2424), School.Year = c("2017", "2017", 
    "2018", "2017", "2017", "2018", "2017", "2017", "2018", "2017"
    ), math.score = c(2440, 2440, NA, 2440, 2440, NA, 2440, 2440, 
    NA, 2440), basc = c(NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_
    ), cohort = c("3", "3", "3", "3", "3", "3", "3", "3", "3", 
    "3"), attendance = c(96.1, 96.1, 100, 96.1, 96.1, 100, 96.1, 
    96.1, 100, 96.1), tme4 = structure(c(NA, 17655, 17655, NA, 
    17655, 17655, NA, 17655, 17655, NA), class = "Date"), t4.minus = c(6.39692965521615, 
    5.97126183979046, 5.47227067367148, 6.39692965521615, 5.97126183979046, 
    5.47227067367148, 6.39692965521615, 5.97126183979046, 5.47227067367148, 
    6.39692965521615), median_grade = c(1536.4, 1536.4, 1372.4, 
    1192, 1192, 1054, 986.6, 986.6, 871.6, 958.4), mad_grade = c(377.17344, 
    377.17344, 278.13576, 167.5338, 167.5338, 161.89992, 139.66092, 
    139.66092, 116.23584, 143.21916), lowerq = c(1323.7, 1323.7, 
    1226.2, 1102.2, 1102.2, 960.6, 902.9, 902.9, 804, 873.5), 
    upperq = c(1964.8, 1964.8, 1655.6, 1329.3, 1329.3, 1181.6, 
    1091.9, 1091.9, 964.2, 1074.1), iqr = c(641.1, 641.1, 429.4, 
    227.1, 227.1, 221, 189, 189, 160.2, 200.6), grade.threshold.upper = c(3888.1, 
    3888.1, 2943.8, 2010.6, 2010.6, 1844.6, 1658.9, 1658.9, 1444.8, 
    1675.9), grade.threshold.lower = c(-599.6, -599.6, -61.9999999999995, 
    420.9, 420.9, 297.6, 335.9, 335.9, 323.4, 271.7), mad = c(377.17344, 
    377.17344, 278.13576, 167.5338, 167.5338, 161.89992, 139.66092, 
    139.66092, 116.23584, 143.21916), z_rw = c(0.350390238376874, 
    0.0417183791440274, 0.171148318277673, -0.108910138097997, 
    -0.497500239197831, -0.365723152941879, 0.512731829784946, 
    -0.588322005081869, -0.0970981769116109, -0.290844134905211
    ), condition = c("BOXED_Conjunction_12", "BOXED_Conjunction_12", 
    "BOXED_Conjunction_12", "BOXED_Conjunction_4", "BOXED_Conjunction_4", 
    "BOXED_Conjunction_4", "BOXED_Feature_12", "BOXED_Feature_12", 
    "BOXED_Feature_12", "BOXED_Feature_4"), pid.avg_rw = c(2140, 
    1845.6, 1884.4, 1242.8, 1088.4, 973.6, 1160.4, 887.6, 910.8, 
    929.2), avg_rw_grade = c(1805.81052631579, 1805.81052631579, 
    1686.41503416856, 1286.07368421053, 1286.07368421053, 1148.48656036446, 
    1033.36421052632, 1033.36421052632, 982.933485193622, 1001.18526315789
    ), sd_grade = c(953.763652869694, 953.763652869694, 1156.80345459324, 
    397.333847576144, 397.333847576144, 478.193844053012, 247.762635541793, 
    247.762635541793, 742.892271389251, 247.504606484003)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -10L))

I can't confirm the issue. Based on your sample data `df %>% spread(condition, pid.avg_rw, fill = NA)` works fine. — Maurits Evers, Sep 12 '19 at 23:47
Changed the data; this should give a better idea of what's going on. This dataset should only be three rows, but it is ten. — James, Sep 13 '19 at 03:09
*"This dataset should only be three rows, but it is ten."* What makes you think that? The dataset you posted has 10 rows; spreading `pid.avg_rw` with respect to the `condition` as `key` will not give you 3 rows, as e.g. columns `median_grade`, `mad_grade`, `lowerq` have all different values. I have a feeling the issue is with a misunderstanding of how `spread` (and long-to-wide conversion in general) works. Are you perhaps after something like `dat %>% select(pid:t4.minus, condition, avg_rw_grade) %>% spread(condition, avg_rw_grade)`? — Maurits Evers, Sep 13 '19 at 03:26
Change the "avg_rw_grade" to "pid.avg_rw" then yes, that is exactly what I was looking for. Thanks! — James, Sep 13 '19 at 04:10

Issues spreading data after removing outliers

0 Answers0