0

I have a data set, in which I am trying to fit a regression model for the Y axis - which has 35 rows. Before regression, I am running a Q-Q plot to see if the data is normal, but instead my data is following two trends in the same plot, which means that there are 2 groups, how should I split the existing Q-Q plot according to the groups?

qqnorm(sqrt(Total_Crime))
qqline(sqrt(Total_Crime))

Above is the code I am using now

Expectation :-

qqnorm(sqrt(Total_Crime **where crime count is >500**))
qqline(sqrt(Total_Crime ** where crime count is >500**))
Martin Gal
  • 16,640
  • 5
  • 21
  • 39
  • This depends on the structure of `Total_crime`. Please show a part of this object, best using `dput(head(Total_crime))`. – Martin Gal Oct 18 '21 at 14:09
  • c(6370, 1515662, 25546, 576090, 970440, 54252) This is how it looks like – Sibangi Bhowmick Oct 18 '21 at 14:18
  • So try `qqnorm(sqrt(Total_Crime[Total_Crime > 500]))`. – Martin Gal Oct 18 '21 at 14:21
  • Maybe I am misunderstanding something, but for a linear regression,, aren't the data points themselves to be uniformly distributed (best case) ? But the residues from the fit should totally be normally distributed around 0 in order to fulfill one of the validity conditions for a lm? – dario Oct 18 '21 at 14:26
  • @dario A linear model assumes normality of the error (approximated by the residuals) and not the input data. – danlooo Oct 18 '21 at 14:33
  • I did try ```qqnorm(sqrt(Total_Crime[Total_Crime > 500]))``` , but it didnot work out. I ended up splitting the dataframe into two according to the group and then plot the Q-Q. But thank you for the help! Appreciate it. – Sibangi Bhowmick Oct 18 '21 at 22:16

1 Answers1

0

Let's assume you want to do a qq-plot for every Species as a sample group (subset) to assess normality of the variable Sepal.Length. Then you can use ggplot2:

library(tidyverse)

data <-
  iris %>%
  group_by(Species) %>%
  transmute(Sepal.Length = Sepal.Length %>% scale())
data
#> # A tibble: 150 x 2
#> # Groups:   Species [3]
#>    Species Sepal.Length[,1]
#>    <fct>              <dbl>
#>  1 setosa            0.267 
#>  2 setosa           -0.301 
#>  3 setosa           -0.868 
#>  4 setosa           -1.15  
#>  5 setosa           -0.0170
#>  6 setosa            1.12  
#>  7 setosa           -1.15  
#>  8 setosa           -0.0170
#>  9 setosa           -1.72  
#> 10 setosa           -0.301 
#> # … with 140 more rows

data %>%
  ggplot(aes(sample = Sepal.Length)) +
  stat_qq() +
  stat_qq_line() +
  facet_wrap(~Species) +
  coord_fixed()

Created on 2021-10-18 by the reprex package (v2.0.1)

Please keep in mind that a linear model assumes the error (approximated by the residuals) to be normally distributed and not any covariate.

danlooo
  • 10,067
  • 2
  • 8
  • 22