Q-Q Plot - Split one plot into 2 groups

Question

I have a data set, in which I am trying to fit a regression model for the Y axis - which has 35 rows. Before regression, I am running a Q-Q plot to see if the data is normal, but instead my data is following two trends in the same plot, which means that there are 2 groups, how should I split the existing Q-Q plot according to the groups?

qqnorm(sqrt(Total_Crime))
qqline(sqrt(Total_Crime))

Above is the code I am using now

Expectation :-

qqnorm(sqrt(Total_Crime **where crime count is >500**))
qqline(sqrt(Total_Crime ** where crime count is >500**))

This depends on the structure of `Total_crime`. Please show a part of this object, best using `dput(head(Total_crime))`. — Martin Gal, Oct 18 '21 at 14:09
c(6370, 1515662, 25546, 576090, 970440, 54252) This is how it looks like — Sibangi Bhowmick, Oct 18 '21 at 14:18
Maybe I am misunderstanding something, but for a linear regression,, aren't the data points themselves to be uniformly distributed (best case) ? But the residues from the fit should totally be normally distributed around 0 in order to fulfill one of the validity conditions for a lm? — dario, Oct 18 '21 at 14:26
@dario A linear model assumes normality of the error (approximated by the residuals) and not the input data. — danlooo, Oct 18 '21 at 14:33
I did try ```qqnorm(sqrt(Total_Crime[Total_Crime > 500]))``` , but it didnot work out. I ended up splitting the dataframe into two according to the group and then plot the Q-Q. But thank you for the help! Appreciate it. — Sibangi Bhowmick, Oct 18 '21 at 22:16

danlooo · Answer 1 · 2021-10-18T14:31:52.837

Let's assume you want to do a qq-plot for every Species as a sample group (subset) to assess normality of the variable Sepal.Length. Then you can use ggplot2:

library(tidyverse)

data <-
  iris %>%
  group_by(Species) %>%
  transmute(Sepal.Length = Sepal.Length %>% scale())
data
#> # A tibble: 150 x 2
#> # Groups:   Species [3]
#>    Species Sepal.Length[,1]
#>    <fct>              <dbl>
#>  1 setosa            0.267 
#>  2 setosa           -0.301 
#>  3 setosa           -0.868 
#>  4 setosa           -1.15  
#>  5 setosa           -0.0170
#>  6 setosa            1.12  
#>  7 setosa           -1.15  
#>  8 setosa           -0.0170
#>  9 setosa           -1.72  
#> 10 setosa           -0.301 
#> # … with 140 more rows

data %>%
  ggplot(aes(sample = Sepal.Length)) +
  stat_qq() +
  stat_qq_line() +
  facet_wrap(~Species) +
  coord_fixed()

^{Created on 2021-10-18 by the reprex package (v2.0.1)}

Please keep in mind that a linear model assumes the error (approximated by the residuals) to be normally distributed and not any covariate.

Q-Q Plot - Split one plot into 2 groups

1 Answers1