Questions tagged [split-apply-combine]

Split-apply-combine operations refer to a common type data manipulation where a function/statistic is computed on several chunks of data independently. The chunks are defined by the value of one variable.

Splitting data by the value of one or more variables
Applying a function to each chunk of data independently
Combining the data back into one piece

Examples of split-apply-combine operations would be:

Computing median income by country from individual-level data (possibly appending the result to the same data)
Generating highest score for each class from student scores

Tools for streamlining split-apply-combine operations are available for popular statistical computation environments (non-exhaustive list):

In the R statistical environment there are dedicated packages for this purpose
- data.table is an extension of data.frame that is optimized for split-apply-combine operations among other things
- dplyr and the original package plyr provide convenient syntax and fast processing for such manipulations
In Python, the pandas library introduces data objects that include a group-by method for this type of operation.

151 questions

votes

2 answers

How to use split-apply-combine pattern of pandas groupby() to normalize multiple columns simultaneously

I am trying to normalize experimental data in a pandas data table that contains multiple columns with numerical observables (features), columns with date and experiment conditions as well as additional non-numerical conditions such as filenames. I…

asked Jul 10 '17 at 13:57

volkerH

votes

3 answers

How to count rows in nested data_frames with dplyr

Here's a dumb example dataframe: df <- data_frame(A = c(rep(1, 5), rep(2, 4)), B = 1:9) %>% group_by(A) %>% nest() which looks like this: > df # A tibble: 2 × 2 A data 1 1 2 …

r dplyr split-apply-combine

asked May 04 '17 at 15:54

crf

1,810
3
15
23

votes

1 answer

Pandas timestamp difference in groupby transform

I have a dataframe with an integer index, session_id, event, and time_stamp that looks like this: In [41]: df = pd.DataFrame(data={'session_id': np.sort(np.random.choice(np.arange(3), 11)), 'event': np.random.choice(['A', 'B', 'C', 'D'], 11),…

python pandas numpy timestamp split-apply-combine

asked Feb 15 '17 at 21:52

lenderson

votes

2 answers

Use lapply() to find percentages of factor variables

I have a data frame that consists of 4 columns that represent questions, and each column as 4 levels that represent responses. Q1 Q2 1 A A 2 A B 3 B B 4 C C 5 D D And I'd like to derive a data.frame such as this: question response…

r lapply reshape2 split-apply-combine

asked Jul 10 '15 at 20:53

cangers

votes

2 answers

Use dplyr's group_by to perform split-apply-combine

I am trying to use dplyr to do the following: tapply(iris$Petal.Length, iris$Species, shapiro.test) I want to split the Petal.Lengths by Speicies, and apply a function, in this case shapiro.test. I read this SO question and quite a number of other…

r group-by dplyr split-apply-combine

asked Oct 30 '14 at 22:43

Ram Narasimhan

22,341
5
49
55

votes

3 answers

How to speed up this Rcpp function?

I wish to implement a simple split-apply-combine routine in Rcpp where a dataset (matrix) is split up into groups, and then the groupwise column sums are returned. This is a procedure easily implemented in R, but often takes quite some time. I have…

c++ r performance rcpp split-apply-combine

asked Jul 28 '14 at 14:28

coffeinjunky

11,254
39
57

votes

2 answers

Efficient way to paste multiple column pairs in R data.table

I'm looking for an efficient way to paste/combine multiple pairs of adjacent columns at once using data.table. My feeble attempt is slow and not so elegant: library(data.table) dt <- data.table(ids = 1:3, x1 = c("A","B","C"), …

r data.table paste split-apply-combine

asked May 22 '19 at 17:33

Bryan

votes

1 answer

Groupby cumulative mean with wide/long pivoting

I have a DataFrame that looks like this (see bottom here for code to reproduce it): date id_ val 0 2017-01-08 a; b 9.3 1 2017-01-07 a; b; c 7.9 2 2017-01-07 a 7.3 3 2017-01-06 b 9.0 4 2017-01-06 c …

python pandas pandas-groupby split-apply-combine

asked Feb 03 '18 at 19:43

Brad Solomon

38,521
31
149
235

votes

1 answer

data.table: aggregate, join, and assign by reference

let's call dta the table I want to assign to, and dts the source of the data I want to join and aggregate to dta. dta = data.table(i=1:4, x=rnorm(4)) dts = data.table(i=rep(1:3, each=3), z=runif(9)) I would think I should be able to join on 'i' and…

r data.table split-apply-combine

asked Nov 07 '17 at 05:42

James

votes

2 answers

Quantile threshold/filter within pandas groupby

I have one categorical variable and two numeric cols: np.random.seed(123) df = pd.DataFrame({'group' : ['a']*10+['b']*10, 'var1' : np.random.randn(20), 'var2' : np.random.randint(10,size=20)}) I want to…

python pandas pandas-groupby split-apply-combine

asked Sep 28 '17 at 16:47

Brad Solomon

38,521
31
149
235

votes

2 answers

Building complex subsets in Pandas DataFrame

I'm making my way around GroupBy, but I still need some help. Let's say that I've a DataFrame with columns Group, giving objects group number, some parameter R and spherical coordinates RA and Dec. Here is a mock DataFrame: df = pd.DataFrame({ …

python pandas dataframe pandas-groupby split-apply-combine

asked Sep 19 '17 at 14:07

Matt

votes

1 answer

split apply combine w/ function or purrr package pmap?

This is a big problem for me to solve. If I had enough reputation to award a bounty I would! Looking to balance territories of accounts of sales reps. I have the process broken up, and I don't really know how to do it across each region. In this…

r for-loop purrr split-apply-combine pmap

asked Feb 07 '17 at 16:04

Matt W.

3,692
2
23
46

votes

4 answers

Group by columns, then compute mean and sd of every other column in R

How do I group by columns, then compute the mean and standard deviation of every other column in R? As an example, consider the famous Iris data set. I want to do something similar to grouping by species, then compute the mean and sd of the…

r split-apply-combine

asked May 26 '16 at 09:59

I Like to Code

7,101
13
38
48

votes

1 answer

How to separate factor interactions in R

I recently had to graph some data based on an interaction of factors and I found it more difficult than I felt something this common should be in R. I suspect I'm missing something. Let's say I have a vector of 30 numbers along with a pair of…

r tapply split-apply-combine

asked Mar 20 '16 at 22:13

pglezen

votes

2 answers

Fastest Way to Split Data Frame by Group, shuffle single vector in R

I am familiar with some of the split-apply-combine functions in R, like ddply, but I am unsure how to split a data frame, modify a single variable within each subset, and then recombine the subsets. I can do this manually, but there is surely a…

r split-apply-combine

asked Dec 10 '15 at 18:50

Michael Davidson

1,391
1
14
31

Prev 1

…

10 11 Next