Questions tagged [statistics]

Consider whether your question would be better asked at https://stats.stackexchange.com. Statistics is the mathematical study of using probability to infer characteristics of a population from a limited number of samples or observations.

Statistics is the scientific study of the collection, analysis, interpretation, presentation, and organization of data. Numerous programming languages provide support for implementing statistical techniques.

Consider whether your question would be better asked at CrossValidated, a Stack Exchange site for probability, statistics, data analysis, data mining, experimental design, and machine learning. StackOverflow questions on statistics should be about implementation and programming problems, not about theoretical discussions of statistics or research design. Therefore, this tag should never be used alone but always in combination with a specific programming language (like for example , , , , ).

16319 questions
79
votes
9 answers

Pandas - Compute z-score for all columns

I have a dataframe containing a single column of IDs and all other columns are numerical values for which I want to compute z-scores. Here's a subsection of it: ID Age BMI Risk Factor PT 6 48 19.3 4 PT 8 43 20.9 NaN PT…
Slavatron
  • 2,278
  • 5
  • 29
  • 40
78
votes
14 answers

Rolling variance algorithm

I'm trying to find an efficient, numerically stable algorithm to calculate a rolling variance (for instance, a variance over a 20-period rolling window). I'm aware of the Welford algorithm that efficiently computes the running variance for a stream…
Abiel
  • 5,251
  • 9
  • 54
  • 74
78
votes
3 answers

T-test in Pandas

If I want to calculate the mean of two categories in Pandas, I can do it like this: data = {'Category': ['cat2','cat1','cat2','cat1','cat2','cat1','cat2','cat1','cat1','cat1','cat2'], 'values': [1,2,3,1,2,3,1,2,3,5,1]} my_data =…
hirolau
  • 13,451
  • 8
  • 35
  • 47
77
votes
14 answers

Select k random elements from a list whose elements have weights

Selecting without any weights (equal probabilities) is beautifully described here. I was wondering if there is a way to convert this approach to a weighted one. I am also interested in other approaches as well. Update: Sampling without replacement
nimcap
  • 10,062
  • 15
  • 61
  • 69
76
votes
7 answers

Convert Z-score (Z-value, standard score) to p-value for normal distribution in Python

How does one convert a Z-score from the Z-distribution (standard normal distribution, Gaussian distribution) to a p-value? I have yet to find the magical function in Scipy's stats module to do this, but one must be there.
gotgenes
  • 38,661
  • 28
  • 100
  • 128
76
votes
2 answers

Confidence intervals for predictions from logistic regression

In R predict.lm computes predictions based on the results from linear regression and also offers to compute confidence intervals for these predictions. According to the manual, these intervals are based on the error variance of fitting, but not on…
unique2
  • 2,162
  • 2
  • 18
  • 23
75
votes
1 answer

Statistical performance of purely functional maps and sets

Given a data structure specification such as a purely functional map with known complexity bounds, one has to pick between several implementations. There is some folklore on how to pick the right one, for example Red-Black trees are considered to be…
73
votes
4 answers

Standard deviation of generic list?

I need to calculate the standard deviation of a generic list. I will try to include my code. Its a generic list with data in it. The data is mostly floats and ints. Here is my code that is relative to it without getting into to much detail:…
Tom Hangler
  • 783
  • 1
  • 5
  • 7
71
votes
5 answers

Simple statistics - Java packages for calculating mean, standard deviation, etc

Could you please suggest any simple Java statistics packages? I don't necessarily need any of the advanced stuff. I was quite surprised that there does not appear to be a function to calculate the Mean in the java.lang.Math package... What are you…
Peter Perháč
  • 20,434
  • 21
  • 120
  • 152
69
votes
4 answers

Constructing a co-occurrence matrix in python pandas

I know how to do this in R. But, is there any function in pandas that transforms a dataframe to an nxn co-occurrence matrix containing the counts of two aspects co-occurring. For example a matrix df: import pandas as pd df = pd.DataFrame({'TFD' :…
user3084006
  • 5,344
  • 11
  • 32
  • 41
66
votes
4 answers

Warning: non-integer #successes in a binomial glm! (survey packages)

I am using the twang package to create propensity scores, which are used as weights in a binomial glm using survey::svyglm. The code looks something like this: pscore <- ps(ppci ~ var1+var2+.........., data=dt....) dt$w <- get.weights(pscore,…
Robert Long
  • 5,722
  • 5
  • 29
  • 50
64
votes
5 answers

Screening (multi)collinearity in a regression model

I hope that this one is not going to be "ask-and-answer" question... here goes: (multi)collinearity refers to extremely high correlations between predictors in the regression model. How to cure them... well, sometimes you don't need to "cure"…
aL3xa
  • 35,415
  • 18
  • 79
  • 112
64
votes
5 answers

Pythonic way of detecting outliers in one dimensional observation data

For the given data, I want to set the outlier values (defined by 95% confidense level or 95% quantile function or anything that is required) as nan values. Following is the my data and code that I am using right now. I would be glad if someone could…
user3410943
63
votes
8 answers

Sorting algorithms for data of known statistical distribution?

It just occurred to me, if you know something about the distribution (in the statistical sense) of the data to sort, the performance of a sorting algorithm might benefit if you take that information into account. So my question is, are there any…
static_rtti
  • 53,760
  • 47
  • 136
  • 192
62
votes
9 answers

Variance Inflation Factor in Python

I'm trying to calculate the variance inflation factor (VIF) for each column in a simple dataset in python: a b c d 1 2 4 4 1 2 6 3 2 3 7 4 3 2 8 5 4 1 9 4 I have already done this in R using the vif function from the usdm library which gives the…
Nizag
  • 909
  • 1
  • 9
  • 15