2

I have a file with about 17000 rows and I preformed a simple linear regression on

    Gene_id expA expB
    GeneA 5.462109    5.006181
    GeneB 2.667692    4.208152
    GeneC 4.796976    4.122660
    GeneD 3.127125    3.676322
    GeneE 4.500583    4.104575
    GeneF 4.598430    4.853717

And I preformed the regression analysis using

plot(log2(data$expA)~log2(data$expB))
regression <- lm(log2(moved.data$expA)~log2(moved.data$expB))
abline(regression)

I am interested in what Genes which are outliers are from my regression analysis.

I tried using the identify(log2(data$expA)~log2(data$expB), row.names(data)) function but I have a lot of points in my graph so to go click over them one by one doesn't seem feasible to me.

I also looked here: Outliers with robust regression in R

but that doesn't tell me how to figure out the Gene names of the outliers

Community
  • 1
  • 1
user3816990
  • 247
  • 1
  • 2
  • 10
  • 1
    How are you defining "outlier"? There's no universally accepted statistical definition. – MrFlick Feb 04 '15 at 19:09
  • shouldn't it be something that is significantly away from the linear regression line, that doesn't fit the 'linear model' well – user3816990 Feb 04 '15 at 19:34

1 Answers1

2

Aren't you really interested in the residuals (how much the results for each gene differs from the regression)? Residuals tell you how closely your data fits your model. In this case, you have a simple linear model. Having many large residuals would indicate that you need a better model.

I would never remove outliers if at all possible. If you do remove them, please be honest and report that they were removed and the criteria for their removal. To be completely honest and transparent, make your data and R script available.

Anyway, keeping what I just said in mind, here is a way to name the "outliers":

# Create sample data
Gene_id <- c("GeneA", "GeneB", "GeneC", "GeneD", "GeneE", "GeneF")
expA    <- c(5.462109, 2.667692, 4.796976, 3.127125, 4.500583, 4.598430)
expB    <- c(5.006181, 4.208152, 4.122660, 3.676322, 4.104575, 4.853717)

# Calculate log base two for results from each experiment
log2_A <- log2(expA)
log2_B <- log2(expB)

# Plot data
plot(log2_A~log2_B)

# Calculate and display the regression line
regression <- lm(log2_A~log2_B)
abline(regression)

# Show regression formula
print(regression)

# Create data frame for sample data
data <- data.frame(Gene_id, expA, expB)

# Calculate residuals
data$residuals <- residuals(regression)

# Choose a threshhold
outlier_threshold <- 0.3

# Print only names of outliers
outliers <- data[ abs(data$residuals) > outlier_threshold, ]
print(outliers$Gene_id)
Christopher Bottoms
  • 11,218
  • 8
  • 50
  • 99
  • Yes you're right, residuals. I wasn't planning to remove them, I wanted to examine them further. How did you get an idea of which threshold to pick. Is 0.3 conventional? Like how pvalue of 0.05 is often used? – user3816990 Feb 04 '15 at 20:26
  • @user3816990 It was arbitrary and has no relation to p-value. It is a value that causes two of the six sample data points to be considered outliers. – Christopher Bottoms Feb 04 '15 at 20:28
  • I guess you could do a histogram of the residuals and get a feel for what to choose a threshold, even though it's arbitrary right? – user3816990 Feb 04 '15 at 20:37
  • Yes that would be arbitrary. What are you trying to do? Has anyone else done something similar? If so, then read up on the literature for your particular problem. What is the accepted statistical standard for a result? You might be surprised and find out, for example, that there is an R package in [Bioconductor](http://bioconductor.org/) that already does what you are trying to do. But even then, you'll probably want to find out how other people are using that package to do the same type of research that you are doing. – Christopher Bottoms Feb 04 '15 at 20:45