-1

I am a new user of both python and R. I have begun using them to try and answer a scientific question that I have. What I am trying to do is the following:

  1. For a matrix, find the mean and standard deviation of each column.
  2. Remove all rows that contain any value outside of the column mean + or - 3 SD.
  3. After removing these rows, calculate a new mean and standard deviation and repeat this cycle until there are no longer any outliers.

This would be done with a matrix of approximately 1000 rows and 20 columns.

I would appreciate any guidance as I am really just learning. Thanks!

  • 3
    The intro manual at http://www.r-project.org will be of good help to the R newcomer. – Rich Scriven Oct 24 '14 at 02:42
  • 2
    Hello, and welcome to Stack Overflow! While we're happy to help, your question is broad, please include code you've tried and what the specific issue it. Please take a moment to read http://stackoverflow.com/help/how-to-ask and understand the guidelines for asking a well formatted, question. – Parker Oct 24 '14 at 02:52
  • 1
    I would dispute that this duplicates your question but you could probably learn a lot of what you need from this: http://stackoverflow.com/questions/18397805/how-do-i-delete-a-row-in-a-numpy-array-which-contains-a-zero read that and edit your question to show us where you're really stuck. – candied_orange Oct 24 '14 at 03:03

1 Answers1

0

You could try in R:

  meanSDs <- apply(m1, 2, function(x) c(mean(x)+3*sd(x), mean(x)-3*sd(x)))
  any(m1 > meanSDs[1,][col(m1)] | m1 < meanSDs[2,][col(m1)])
  #[1] TRUE

Create a function

 fun1 <- function(mat,n){
                repeat{
                meanSDs <- apply(mat, 2, function(x) 
                                   c(mean(x)+n*sd(x), mean(x)-n*sd(x)))
                indx <-mat > meanSDs[1,][col(mat)] | mat < meanSDs[2,][col(mat)]
                indx1 <- !rowSums(indx)
                mat <- mat[indx1,]
                if(all(indx1)) break
                   }
               return(mat)
             }



   m2 <- fun1(m1,3)

Checking the results

  meanSDs <- apply(m2, 2, function(x) c(mean(x)+3*sd(x), mean(x)-3*sd(x)))
  any(m2 > meanSDs[1,][col(m2)] | m2 < meanSDs[2,][col(m2)])
  #[1] FALSE

data

 set.seed(79)
 m1 <- matrix(rnorm(1000*20,1,8), ncol=20)
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks for the response everyone. I will get started with this help and post any issues I have. – Samwise327 Oct 24 '14 at 14:26
  • akrun, thanks so much. It took me a week to figure out what you did, but this works. It also helped me learn how to structure these sorts of problems. Thanks so much! – Samwise327 Nov 02 '14 at 03:19
  • @Samwise327 No problem. For these kind of problems, I would use either `while` or `repeat` – akrun Nov 02 '14 at 03:55