Questions tagged [split-apply-combine]

Split-apply-combine operations refer to a common type data manipulation where a function/statistic is computed on several chunks of data independently. The chunks are defined by the value of one variable.

Splitting data by the value of one or more variables
Applying a function to each chunk of data independently
Combining the data back into one piece

Examples of split-apply-combine operations would be:

Computing median income by country from individual-level data (possibly appending the result to the same data)
Generating highest score for each class from student scores

Tools for streamlining split-apply-combine operations are available for popular statistical computation environments (non-exhaustive list):

In the R statistical environment there are dedicated packages for this purpose
- data.table is an extension of data.frame that is optimized for split-apply-combine operations among other things
- dplyr and the original package plyr provide convenient syntax and fast processing for such manipulations
In Python, the pandas library introduces data objects that include a group-by method for this type of operation.

151 questions

votes

4 answers

R: subsetting and ordering large data.frame without forloop

I have long table with 97M rows. Each row contains the information of an action taken by a person and the timestamp for that action, in the form: actions <- c("walk","sleep", "run","eat") people <- c("John","Paul","Ringo","George") timespan <-…

asked May 13 '15 at 16:08

CptNemo

6,455
16
58
107

votes

3 answers

Combining Rows - Summing Certain Columns and Not Others in R

I have a data set that has repeated names in column 1 and then 3 other columns that are numeric. I want to combine the rows of repeated names into one column and sum 2 of the columns while leaving the other alone. Is there a simple way to do this? I…

r split-apply-combine

asked Mar 18 '15 at 18:49

user3585829

votes

1 answer

R apply function on groups of data frame

I need to run ANOVA on each subject individually. I have a dataframe consists of data coming from 37 subjects and I don't want to loop 37 times to run ANOVA separately 37 times on each subject. These are the first 4 rows of my data: latency…

r split-apply-combine

asked Dec 15 '14 at 09:30

user4045430

votes

1 answer

Pairwise correlation

I have a dataframe that looks similar to this: In [45]: df Out[45]: Item_Id Location_Id date price 0 A 5372 1 0.5 1 A 5372 2 NaN 2 A 5372 3 1.0 3 A 6065 1 …

python pandas split-apply-combine

asked Dec 13 '14 at 20:11

svenkatesh

1,152
2
10
25

votes

2 answers

Applying multiple functions to each column in a data frame using aggregate

When I need to apply multiple functions to multiple columns sequentially and aggregate by multiple columns and want the results to be bound into a data frame I usually use aggregate() in the following manner: # bogus functions foo1 <-…

r aggregate split-apply-combine

asked Oct 29 '14 at 07:12

lord.garbage

5,884
5
36
55

votes

4 answers

Create binary variable based on number of unique / distinct values by group

I have data as follows: userID <- c(1,1,1,2,2,2,3,3,3) product <- c("a","a","a","b","b","c","a","b","c") df <- data.frame(userID, product) For each 'userID', I want to create a binary indicator variable which is 1 if there are more than one unique…

r dataframe data-manipulation split-apply-combine

asked Oct 15 '14 at 10:13

Daryl

votes

2 answers

Simple moving average on an unbalanced panel in R

I am working with an unbalanced, irregularly spaced cross-sectional time series. My goal is to obtain a lagged moving average vector for the "Quantity" vector, segmented by "Subject". In other words, say the the the following Quanatities have been…

r data.table plyr panel-data split-apply-combine

asked Nov 10 '13 at 20:02

user27636

1,070
1
18
26

votes

2 answers

improvement on tapply (shifting groups of vectors)

The order of the return object from tapply() is ambiguous, so I've started to worry about this bit of code: #d <- data.frame(value = c(1,2,3,5), # source = c("a","a","b","b")) d$value <- unlist(tapply(d$value, d$source, function(v)…

r group-by tapply split-apply-combine

asked Jun 06 '23 at 20:01

Taylor

1,797
4
26
51

votes

1 answer

Combine grouped DF in Julia with Floats and Strings

I have a bunch of Grouped DataFrames gdf that I want to combine. I want to combine the GDF with the mean var1 which is a Float and the first element of var2 which is a String. I tried combine(gdf, :var1 .=> mean, :var2 .=> first(:var2)) But…

julia split-apply-combine julia-dataframe

asked Jul 12 '22 at 03:25

Moshi

votes

1 answer

How to produce grouped summary statistics without explicitly naming the variables

Given a Julia dataframe with many variables and a final class column: julia> df 5×3 DataFrame Row │ v1 v2 cl │ Int64? Int64 Int64 ─────┼─────────────────────── 1 │ 10 1 2 2 │ 20 2 2 3 │ …

dataframe julia split-apply-combine

asked Jul 01 '22 at 09:50

Antonello

6,092
3
31
56

votes

1 answer

pandas split-apply-combine creates undesired MultiIndex

I am using the split-apply-combine pattern in pandas to group my df by a custom aggregation function. But this returns an undesired DataFrame with the grouped column existing twice: In an MultiIndex and the columns. The following is a simplified…

pandas dataframe pandas-groupby split-apply-combine

asked Nov 22 '20 at 20:47

Tomas Pazur

votes

1 answer

Applying group-specific function that returns a single series

I'm trying to figure out an efficient split/apply/combine scheme for the following scenario. Consider the pandas dataframe demoAll defined below: import datetime import pandas as pd demoA = pd.DataFrame({'date':[datetime.date(2010,1,1),…

python pandas pandas-groupby split-apply-combine

asked Dec 27 '19 at 06:19

bigO6377

1,256
3
14
28

votes

1 answer

Using split-apply-combine to remove some values with a customized function and combine what's left

So this isn't the dataset I need to work with but it's a template for a huge one I'm working with (~1.8 million data points) for a cancer research project, so I figured if I could get this to work with a smaller one, then I can adapt it for my large…

python pandas split-apply-combine

asked Oct 08 '19 at 02:40

Brenton

votes

1 answer

pandas apply with parameter list

I have a simple DataFrame Object: df = pd.DataFrame(np.random.random_sample((5,5))) df["col"] = ["A", "B", "C", "A" ,"B"] #simple function def func_apply(df,param=1): pd.Series(np.random(3)*param,name=str(param)) Now applying the function…

python pandas dataframe pandas-groupby split-apply-combine

asked Jun 28 '19 at 22:05

MichaelRazum

votes

1 answer

How to rowwise-sort a matrix containing subgrouped data

In matrix A, every column represents an output variable and every row represents a reading (6 rows in total). Every output has a certain subgroup size (groups of 3 rows). I need A's elements to be sorted in the vertical direction within every…

matlab sorting matrix grouping split-apply-combine

asked May 23 '18 at 07:58

user9003011

Prev 1 2

…

10 11 Next