Questions tagged [outliers]

An outlier is an observation that appears to be unusual or not well described relative to a simple characterization of a dataset.

Overview

Outliers are not necessarily bad or wrong, nor do they need to be removed from data for further analysis. However, outliers (of which there can be more than one in any set of data) indicate that some data at least appear to differ from the bulk of the dataset, suggesting they should be individually examined and understood. Also, some statistical procedures are sensitive to outliers: this means that removal of one or more outliers could substantially change the conclusions of those procedures.

Tag usage

Consider whether the question would be more suitable on Stack Overflow SE (programming-related) or Cross Validated SE (statistics-related).

In scientific software for statistical computing and graphics, function boxplot.stats provides a basic method for detecting outliers.

1199 questions
4
votes
1 answer

Capping the outliers

I have a data frame with 3 numerical variables for which I am trying to cap the outliers between 0.01 to 0.99 percentile but it's not working. df[['TotalVisits', 'Total Time Spent on Website', 'Page Views Per Visit']].describe(percentiles=[.25, .5,…
4
votes
2 answers

Pandas replace by NaN if the difference with the previous row is above a treshold

I have an half hourly dataframe df from which i want to remove outliers. date = ['2015-02-03 23:00:00','2015-02-03 23:30:00','2015-02-04 00:00:00','2015-02-04 00:30:00'] value_column = [33.24 , 500 , 34.39 , 34.49 ] df = pd.DataFrame({'value…
Peslier53
  • 587
  • 1
  • 7
  • 21
4
votes
1 answer

What does setting the 'contamination' parameter to 'auto' in Sklearn Outlier Detection methods do?

I have a dataset where I need to be able to control to what extent the Outlier Detection Model (Isolation Forest, Elliptic Envelope, OneClassSVM...) considers a given point an outlier or not (something similar to the Z-score or IQR-score). This…
4
votes
4 answers

How to find outliers in document classification with million documents?

I have million documents which belongs to different classes (100 classes). I want to find outlier documents in each class (which doesn't belong to that class but wrongly classified) and filter them. I can do document similarity using cosine…
4
votes
1 answer

How to identify outliers with density plot

I'm trying to identify outliers with my density plot. I am currently using the seaborn library to plot my data. How would I go about identifying outliers? I have been looking at implementing the Z-score with the stats library, is this the only way…
B.Billy
  • 49
  • 4
4
votes
3 answers

How to Replace Outliers with Median in Pandas dataframe?

Here's my dataframe: cars_num_df.head(10) mpg cylinders displacement horsepower weight acceleration age 0 18.0 8 307.0 130.0 3504.0 12.0 13 1 15.0 8 350.0 165.0 3693.0 …
morelloking
  • 193
  • 1
  • 3
  • 11
4
votes
3 answers

ROC curve for Isolation Forest

I am trying to plot the ROC curve to evaluate the accuracy of Isolation Forest for a Breast Cancer dataset. I calculated the True Positive rate (TPR) and False Positive Rate (FPR) from the confusion matrix. However, I do not understand how the TPR…
Nnn
  • 191
  • 3
  • 9
4
votes
2 answers

Remove remains in a letter image with Python

I have a set of images that represent letters extracted from an image of a word. In some images there are remains of the adjacent letters and I want to eliminate them but I do not know how. Some samples I'm working with openCV and I've tried two…
Udl David
  • 43
  • 4
4
votes
2 answers

Drop rows based on one column values

I've a dataframe which looks like this: wave mean median mad 0 4050.32 -0.016182 -0.011940 0.008885 1 4208.98 0.023707 0.007189 0.032585 2 4508.28 3.662293 0.001414 7.193139 3 4531.62 -15.459313…
4
votes
2 answers

Pandas: How to detect the peak points (outliers) in a dataframe?

I am having a pandas dataframe with several of speed values which is continuously moving values, but its a sensor data, so we often get the errors in the middle at some points the moving average seems to be not helping also, so what methods can I…
id101112
  • 1,012
  • 2
  • 16
  • 28
4
votes
1 answer

tsoutliers dependency issue: dependency KFKSDS has non-zero exit status?

While working on outlier detection on a time series data. I came across [tsoutliers][1] packages that does implement Chen and Liu's time series outlier detection. But I am unable to install tsoutliers in R install.packages("tsoutliers") I am…
Anoop Toffy
  • 918
  • 1
  • 9
  • 22
4
votes
1 answer

Detecting outliers in a Pandas dataframe using a rolling standard deviation

I have a DataFrame for a fast Fourier transformed signal. There is one column for the frequency in Hz and another column for the corresponding amplitude. I have read a post made a couple of years ago, that you can use a simple boolean function to…
Jack
  • 41
  • 1
  • 3
4
votes
1 answer

Remove outliers from pandas dataframe python

I have a code that creates a dataframe using pandas import pandas as pd import numpy as np x = (g[0].time[:111673]) y = (g[0].data.f[:111673]) df = pd.DataFrame({'Time': x, 'Data': y}) #df This prints out: Data Time 0 …
eliza.b
  • 447
  • 1
  • 8
  • 16
4
votes
1 answer

Removing Multivariate Outliers With mvoutlier

Problem I have a dataframe that composes of > 5 variables at any time and am trying to do a K-Means of it. Because K-Means is greatly affected by outliers, I've been trying to look for a few hours on how to calculate and remove multivariate…
Jon
  • 89
  • 1
  • 8
4
votes
1 answer

Isolation Forest

I'm currently working on identifying outliers in my data set using the IsolationForest method in Python, but don't completely understand the example on…
bosbraves
  • 65
  • 4