Questions tagged [statistics]

Consider whether your question would be better asked at https://stats.stackexchange.com. Statistics is the mathematical study of using probability to infer characteristics of a population from a limited number of samples or observations.

Statistics is the scientific study of the collection, analysis, interpretation, presentation, and organization of data. Numerous programming languages provide support for implementing statistical techniques.

Consider whether your question would be better asked at CrossValidated, a Stack Exchange site for probability, statistics, data analysis, data mining, experimental design, and machine learning. StackOverflow questions on statistics should be about implementation and programming problems, not about theoretical discussions of statistics or research design. Therefore, this tag should never be used alone but always in combination with a specific programming language (like for example , , , , ).

16319 questions
4
votes
2 answers

How can I weight features for better clustering with a very small data set?

I'm working on a program that takes in several (<50) high dimension points in feature space (1000+ dimensions) and performing hierarchical clustering on them by recursively using standard k-clustering. My problem is that in any one k-clustering…
4
votes
1 answer

Finding the elbow point of a curve in a stable way?

I am aware of the existence of this, and this on this topic. However, I would like to finalize on an actual implementation in Python this time. My only problem is that the elbow point seems to be changing from different instantiations of my code.…
Legend
  • 113,822
  • 119
  • 272
  • 400
4
votes
0 answers

How to find feature Interactions between all columns in a dataframe, Python?

Friedman’s H-statistic The interpretable ML book by Christoph Molnar actually gives us a workable approach, by using Friedman’s H-statistic based on the decomposition of the partial dependence values to calculate the feature interactions. In Python,…
Ailurophile
  • 2,552
  • 7
  • 21
  • 46
4
votes
1 answer

What does SAGA stand for in optimization solvers?

If SAG stands for Stochastic Average Gradient what does SAGA stand for? From Sklearn (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) From lightning (http://contrib.scikit-learn.org/lightning/)
Student
  • 1,197
  • 4
  • 22
  • 39
4
votes
7 answers

best language or program for finding patterns and statistical analysis?

I have a program that downloads basic historical stock data from yahoo and puts it into an SQLite database. I'd like to be able to perform queries such as finding the moving average, and determining the longest period where a stock has either…
Jared
  • 39,513
  • 29
  • 110
  • 145
4
votes
4 answers

What software package can you suggest for a programmer who rarely works with statistics?

Being a programmer I occasionally find the need to analyze large amounts of data such as performance logs or memory usage data, and I am always frustrated by how much time it takes me to do something that I expect to be easier. As an example to put…
flodin
  • 5,215
  • 4
  • 26
  • 39
4
votes
1 answer

API: Top 100 twitter users in a country (rank by followers)

I would like to get a list of top 100 twitter users by country/ location (ranked by the number of followers they have). I can't see how I can achieve this using the Twitter API but I know that it cane be done because sites like…
JR.
  • 423
  • 1
  • 5
  • 13
4
votes
0 answers

Obtaining the least means squares from MixedLM model

I'm not the best statistician by a long shot, but I was trying to obtain the least squares means of my fit after fitting a mixed linear effect model using statsmodels.api.MixedLM. By printing the summary of the fit, I only see that it returns the…
Joker
  • 81
  • 5
4
votes
1 answer

How can I generate data which will show inverted bell curve for normal distribution

I have generated random data which follows normal distribution using the below code: import numpy as np import matplotlib.pyplot as plt import seaborn as sns rng = np.random.default_rng() number_of_rows = 10000 mu = 0 sigma = 1 data =…
4
votes
3 answers

SQL Server - How to add a column of percentile values of another column?

I'd like to have a calculated field that gives me the percentile of a column's value in a table. What is the best way to do so? I have a table with only one column containing values ranging from 0 to 10000, randomly distributed. I want to add…
user776676
  • 4,265
  • 13
  • 58
  • 77
4
votes
3 answers

Fast and accurate computation of studentized external residuals in R

I want to compute the external studentized residuals of a dataset {x,y} of size n in R given the following constraints: (very) high precision high performance (avoiding loops where possible) R language (including RCPP) The R code should be fast…
Grasshoper
  • 457
  • 2
  • 13
4
votes
1 answer

Geometric mean functions returning Inf

Trying to solve a homework problem: I have two functions to get the geometric mean from 1000 observations from the exponential distribution with a rate of .01. The following keeps returning Inf. gmean <- function(n) { prod(n)^(1/length(n)) …
aaaa
  • 43
  • 3
4
votes
2 answers

How to verify if two text datasets are from different distribution?

I have two text datasets. Each dataset consists of multiple sequences and each sequence can have more than one sentence. How do I measure if both datasets are from same distribution? The purpose is to verify transfer learning from one distribution…
4
votes
1 answer

How to calculate updated and deleted values with Welford's online algorithm

Use case: Streaming large amounts of event source data that may have inserts, updates, and deletes and has guaranteed order. Assuming Welford's Algorithm in this form in an event stream for insert: private double _count = 0; private double _mean =…
ChaseAucoin
  • 725
  • 1
  • 8
  • 16
4
votes
1 answer

emmeans function won't run or takes too long to run

I am new. I want to use the emmeans function to calculate estimated marginal means based on a model. This model is done by lmer function. The problem is I have lots (20ish) of fixed effect variables and one random effect variable. I can run lmer…
pink99
  • 41
  • 3
1 2 3
99
100