Questions tagged [split-apply-combine]

Split-apply-combine operations refer to a common type data manipulation where a function/statistic is computed on several chunks of data independently. The chunks are defined by the value of one variable.

Split-apply-combine operations refer to a common type data manipulation where a function/statistic is computed on several chunks of data independently. The chunks are defined by the value of one variable. As the name implies, they are composed of three parts:

  1. Splitting data by the value of one or more variables
  2. Applying a function to each chunk of data independently
  3. Combining the data back into one piece

Examples of split-apply-combine operations would be:

  • Computing median income by country from individual-level data (possibly appending the result to the same data)
  • Generating highest score for each class from student scores

Tools for streamlining split-apply-combine operations are available for popular statistical computation environments (non-exhaustive list):

  • In the R statistical environment there are dedicated packages for this purpose

    • data.table is an extension of data.frame that is optimized for split-apply-combine operations among other things
    • dplyr and the original package plyr provide convenient syntax and fast processing for such manipulations
  • In Python, the pandas library introduces data objects that include a group-by method for this type of operation.

151 questions
3
votes
4 answers

R: subsetting and ordering large data.frame without forloop

I have long table with 97M rows. Each row contains the information of an action taken by a person and the timestamp for that action, in the form: actions <- c("walk","sleep", "run","eat") people <- c("John","Paul","Ringo","George") timespan <-…
CptNemo
  • 6,455
  • 16
  • 58
  • 107
3
votes
3 answers

Combining Rows - Summing Certain Columns and Not Others in R

I have a data set that has repeated names in column 1 and then 3 other columns that are numeric. I want to combine the rows of repeated names into one column and sum 2 of the columns while leaving the other alone. Is there a simple way to do this? I…
user3585829
  • 945
  • 11
  • 24
3
votes
1 answer

R apply function on groups of data frame

I need to run ANOVA on each subject individually. I have a dataframe consists of data coming from 37 subjects and I don't want to loop 37 times to run ANOVA separately 37 times on each subject. These are the first 4 rows of my data: latency…
user4045430
  • 207
  • 1
  • 6
  • 13
3
votes
1 answer

Pairwise correlation

I have a dataframe that looks similar to this: In [45]: df Out[45]: Item_Id Location_Id date price 0 A 5372 1 0.5 1 A 5372 2 NaN 2 A 5372 3 1.0 3 A 6065 1 …
svenkatesh
  • 1,152
  • 2
  • 10
  • 25
3
votes
2 answers

Applying multiple functions to each column in a data frame using aggregate

When I need to apply multiple functions to multiple columns sequentially and aggregate by multiple columns and want the results to be bound into a data frame I usually use aggregate() in the following manner: # bogus functions foo1 <-…
lord.garbage
  • 5,884
  • 5
  • 36
  • 55
3
votes
4 answers

Create binary variable based on number of unique / distinct values by group

I have data as follows: userID <- c(1,1,1,2,2,2,3,3,3) product <- c("a","a","a","b","b","c","a","b","c") df <- data.frame(userID, product) For each 'userID', I want to create a binary indicator variable which is 1 if there are more than one unique…
Daryl
  • 37
  • 1
  • 4
3
votes
2 answers

Simple moving average on an unbalanced panel in R

I am working with an unbalanced, irregularly spaced cross-sectional time series. My goal is to obtain a lagged moving average vector for the "Quantity" vector, segmented by "Subject". In other words, say the the the following Quanatities have been…
user27636
  • 1,070
  • 1
  • 18
  • 26
2
votes
2 answers

improvement on tapply (shifting groups of vectors)

The order of the return object from tapply() is ambiguous, so I've started to worry about this bit of code: #d <- data.frame(value = c(1,2,3,5), # source = c("a","a","b","b")) d$value <- unlist(tapply(d$value, d$source, function(v)…
Taylor
  • 1,797
  • 4
  • 26
  • 51
2
votes
1 answer

Combine grouped DF in Julia with Floats and Strings

I have a bunch of Grouped DataFrames gdf that I want to combine. I want to combine the GDF with the mean var1 which is a Float and the first element of var2 which is a String. I tried combine(gdf, :var1 .=> mean, :var2 .=> first(:var2)) But…
Moshi
  • 193
  • 6
2
votes
1 answer

How to produce grouped summary statistics without explicitly naming the variables

Given a Julia dataframe with many variables and a final class column: julia> df 5×3 DataFrame Row │ v1 v2 cl │ Int64? Int64 Int64 ─────┼─────────────────────── 1 │ 10 1 2 2 │ 20 2 2 3 │ …
Antonello
  • 6,092
  • 3
  • 31
  • 56
2
votes
1 answer

pandas split-apply-combine creates undesired MultiIndex

I am using the split-apply-combine pattern in pandas to group my df by a custom aggregation function. But this returns an undesired DataFrame with the grouped column existing twice: In an MultiIndex and the columns. The following is a simplified…
Tomas Pazur
  • 121
  • 1
  • 2
  • 6
2
votes
1 answer

Applying group-specific function that returns a single series

I'm trying to figure out an efficient split/apply/combine scheme for the following scenario. Consider the pandas dataframe demoAll defined below: import datetime import pandas as pd demoA = pd.DataFrame({'date':[datetime.date(2010,1,1),…
bigO6377
  • 1,256
  • 3
  • 14
  • 28
2
votes
1 answer

Using split-apply-combine to remove some values with a customized function and combine what's left

So this isn't the dataset I need to work with but it's a template for a huge one I'm working with (~1.8 million data points) for a cancer research project, so I figured if I could get this to work with a smaller one, then I can adapt it for my large…
Brenton
  • 435
  • 2
  • 5
  • 14
2
votes
1 answer

pandas apply with parameter list

I have a simple DataFrame Object: df = pd.DataFrame(np.random.random_sample((5,5))) df["col"] = ["A", "B", "C", "A" ,"B"] #simple function def func_apply(df,param=1): pd.Series(np.random(3)*param,name=str(param)) Now applying the function…
2
votes
1 answer

How to rowwise-sort a matrix containing subgrouped data

In matrix A, every column represents an output variable and every row represents a reading (6 rows in total). Every output has a certain subgroup size (groups of 3 rows). I need A's elements to be sorted in the vertical direction within every…
user9003011
  • 306
  • 1
  • 10
1 2
3
10 11