Questions tagged [split-apply-combine]

Split-apply-combine operations refer to a common type data manipulation where a function/statistic is computed on several chunks of data independently. The chunks are defined by the value of one variable.

Split-apply-combine operations refer to a common type data manipulation where a function/statistic is computed on several chunks of data independently. The chunks are defined by the value of one variable. As the name implies, they are composed of three parts:

  1. Splitting data by the value of one or more variables
  2. Applying a function to each chunk of data independently
  3. Combining the data back into one piece

Examples of split-apply-combine operations would be:

  • Computing median income by country from individual-level data (possibly appending the result to the same data)
  • Generating highest score for each class from student scores

Tools for streamlining split-apply-combine operations are available for popular statistical computation environments (non-exhaustive list):

  • In the R statistical environment there are dedicated packages for this purpose

    • data.table is an extension of data.frame that is optimized for split-apply-combine operations among other things
    • dplyr and the original package plyr provide convenient syntax and fast processing for such manipulations
  • In Python, the pandas library introduces data objects that include a group-by method for this type of operation.

151 questions
1
vote
1 answer

Pandas: How to combine several rows with the same column value and create a new Dataframe which covers all possibilities?

There exists a DataFrame like this: id name age 0x0 Hans 32 0x0 Peter 21 0x1 Jan 42 0x1 Simon 25 0x1 Klaus 51 0x1 Franz 72 I'm aiming to create a DataFrame that covers any possible combination within the same ID. The only…
1
vote
1 answer

Compute and broadcast a count in pandas (with groupby transform)

How can I compute and broadcast a count in pandas? To compute a count: df.groupby('field').size() To broadcast an aggregation to the original dataframe: df.groupby('field')['field_to_aggregate'].transform(aggregation) The latter works if I specify…
Michele Piccolini
  • 2,634
  • 16
  • 29
1
vote
1 answer

pandas groupby shift is not respecting the groups

I have the following DataFrame and an arbitrary function df = pd.DataFrame( {'grp': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3], 'val': [0.80485036, 0.30698609, 0.33518013, 0.12214516, 0.66355629, 0.71277808,…
Jonathan
  • 1,287
  • 14
  • 17
1
vote
1 answer

Filtering pandas groupby using value of column (string datatype)

I've been working on a large genomics data set that contains multiple reads of every sample to make sure we got the data, but when analyzing it we need to drop it down to one row so we don't skew the data (count the gene as present 6 times when it…
Andrew T
  • 186
  • 2
  • 11
1
vote
2 answers

Swift Combine - prefix publisher on array

I'm playing around with publishers in Swift/Combine, I have a function that fetches 100 records and returns them as an array. As a test I want to return just the first two items, but it's not working as I expected it to, it always returns 100, my…
Chris
  • 2,739
  • 4
  • 29
  • 57
1
vote
1 answer

Matlab: Use Splitapply to write multiple files

I have grouped tables by a variable and I am trying to write multiple files based on the grouping variable. But it does not work. I used findgroups and splitapply, but the splitapply is where I am having problems. Here is one version of the commands…
Hobbycoder
  • 11
  • 1
1
vote
1 answer

Un-nest output of d3.group or d3.rollup?

I am using d3-array rollup to do group-by like counting operation, in preparation to generate an html table. I have a variable number of grouping keys, which I am passing like this: var rollup_keys = new Map([ ['count', v => v.length], …
deargle
  • 487
  • 4
  • 8
1
vote
0 answers

Matlab best practice for choosing and using splitapply, rowfun, and varfun?

Matlab seems to have a number of different code patterns for realizing SQL's GOUPBY aggregation of data. To me it seems that this makes it hard for best practice and code idioms to coalesce. Are there guidelines for which are best for which…
user36800
  • 2,019
  • 2
  • 19
  • 34
1
vote
1 answer

Combine 3 columns to one column pandas

I have the following code: input= pd.DataFrame({'Police District Name': ['WHEATON', 'SILVER SPRING', 'BETHESDA','GERMANTOWN','WHEATON','MONTGOMERY VILLAGE'], 'cn1': ['Crime Against Person', 'Crime Against Person', 'Crime Against…
mango90001
  • 43
  • 7
1
vote
3 answers

Create a variable whose values are with data type array and those values came from multiple columns

I would like to know how I could come up with the new variable "test_array" which is of data type array and created by combining columns "test_1" to "test_4" because I wanted to use it for further calculations. id test_1 test_2 test_3 …
Ashtasora
  • 35
  • 5
1
vote
2 answers

How to add totals as well as group_by statistics in R

When computing any statistic using summarise and group_by we only get the summary statistic per-category, and not the value for all the population (Total). How to get both? I am looking for something clean and short. Until now I can only think…
1
vote
0 answers

Matlab `splitapply` speed trend?

My organization is usually a few years behind the most recent Matlab version. I am finding that splitapply is extremely slow when there are many groups (two numerical grouping variables), in sharp contrast to my experience with SQL. I suspect that…
user36800
  • 2,019
  • 2
  • 19
  • 34
1
vote
3 answers

Alternative to splitapply in Matlab

I am trying to run someone else's Matlab code that uses the splitapply function, which is only available in R2018a. I am currently using R2015a; is there a simple (albeit less efficient) alternative implementation which achieves the same purpose…
p-value
  • 608
  • 8
  • 22
1
vote
3 answers

Time Lag based on another variable

Given: test <- data.frame(Participant= c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3), Day = c(0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9), Value= c(1:30)) I want to arrive…
D500
  • 442
  • 5
  • 17
1
vote
2 answers

Python Pandas Aggregate Series Data Within a DataFrame

Within a dataframe, I am trying split-apply-combine to a column which contains series data element-wise. (I've searched SO but haven't found anything pertaining to series within data frames.) The data frame: import pandas as pd from pandas import…
Mark Pedigo
  • 87
  • 1
  • 10