Questions tagged [split-apply-combine]

Split-apply-combine operations refer to a common type data manipulation where a function/statistic is computed on several chunks of data independently. The chunks are defined by the value of one variable.

Split-apply-combine operations refer to a common type data manipulation where a function/statistic is computed on several chunks of data independently. The chunks are defined by the value of one variable. As the name implies, they are composed of three parts:

  1. Splitting data by the value of one or more variables
  2. Applying a function to each chunk of data independently
  3. Combining the data back into one piece

Examples of split-apply-combine operations would be:

  • Computing median income by country from individual-level data (possibly appending the result to the same data)
  • Generating highest score for each class from student scores

Tools for streamlining split-apply-combine operations are available for popular statistical computation environments (non-exhaustive list):

  • In the R statistical environment there are dedicated packages for this purpose

    • data.table is an extension of data.frame that is optimized for split-apply-combine operations among other things
    • dplyr and the original package plyr provide convenient syntax and fast processing for such manipulations
  • In Python, the pandas library introduces data objects that include a group-by method for this type of operation.

151 questions
4
votes
2 answers

How to use split-apply-combine pattern of pandas groupby() to normalize multiple columns simultaneously

I am trying to normalize experimental data in a pandas data table that contains multiple columns with numerical observables (features), columns with date and experiment conditions as well as additional non-numerical conditions such as filenames. I…
volkerH
  • 115
  • 2
  • 7
4
votes
3 answers

How to count rows in nested data_frames with dplyr

Here's a dumb example dataframe: df <- data_frame(A = c(rep(1, 5), rep(2, 4)), B = 1:9) %>% group_by(A) %>% nest() which looks like this: > df # A tibble: 2 × 2 A data 1 1 2 …
crf
  • 1,810
  • 3
  • 15
  • 23
4
votes
1 answer

Pandas timestamp difference in groupby transform

I have a dataframe with an integer index, session_id, event, and time_stamp that looks like this: In [41]: df = pd.DataFrame(data={'session_id': np.sort(np.random.choice(np.arange(3), 11)), 'event': np.random.choice(['A', 'B', 'C', 'D'], 11),…
lenderson
  • 135
  • 1
  • 5
4
votes
2 answers

Use lapply() to find percentages of factor variables

I have a data frame that consists of 4 columns that represent questions, and each column as 4 levels that represent responses. Q1 Q2 1 A A 2 A B 3 B B 4 C C 5 D D And I'd like to derive a data.frame such as this: question response…
cangers
  • 390
  • 2
  • 9
4
votes
2 answers

Use dplyr's group_by to perform split-apply-combine

I am trying to use dplyr to do the following: tapply(iris$Petal.Length, iris$Species, shapiro.test) I want to split the Petal.Lengths by Speicies, and apply a function, in this case shapiro.test. I read this SO question and quite a number of other…
Ram Narasimhan
  • 22,341
  • 5
  • 49
  • 55
4
votes
3 answers

How to speed up this Rcpp function?

I wish to implement a simple split-apply-combine routine in Rcpp where a dataset (matrix) is split up into groups, and then the groupwise column sums are returned. This is a procedure easily implemented in R, but often takes quite some time. I have…
coffeinjunky
  • 11,254
  • 39
  • 57
3
votes
2 answers

Efficient way to paste multiple column pairs in R data.table

I'm looking for an efficient way to paste/combine multiple pairs of adjacent columns at once using data.table. My feeble attempt is slow and not so elegant: library(data.table) dt <- data.table(ids = 1:3, x1 = c("A","B","C"), …
Bryan
  • 933
  • 1
  • 7
  • 21
3
votes
1 answer

Groupby cumulative mean with wide/long pivoting

I have a DataFrame that looks like this (see bottom here for code to reproduce it): date id_ val 0 2017-01-08 a; b 9.3 1 2017-01-07 a; b; c 7.9 2 2017-01-07 a 7.3 3 2017-01-06 b 9.0 4 2017-01-06 c …
Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
3
votes
1 answer

data.table: aggregate, join, and assign by reference

let's call dta the table I want to assign to, and dts the source of the data I want to join and aggregate to dta. dta = data.table(i=1:4, x=rnorm(4)) dts = data.table(i=rep(1:3, each=3), z=runif(9)) I would think I should be able to join on 'i' and…
James
  • 630
  • 1
  • 6
  • 15
3
votes
2 answers

Quantile threshold/filter within pandas groupby

I have one categorical variable and two numeric cols: np.random.seed(123) df = pd.DataFrame({'group' : ['a']*10+['b']*10, 'var1' : np.random.randn(20), 'var2' : np.random.randint(10,size=20)}) I want to…
Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
3
votes
2 answers

Building complex subsets in Pandas DataFrame

I'm making my way around GroupBy, but I still need some help. Let's say that I've a DataFrame with columns Group, giving objects group number, some parameter R and spherical coordinates RA and Dec. Here is a mock DataFrame: df = pd.DataFrame({ …
Matt
  • 763
  • 1
  • 7
  • 25
3
votes
1 answer

split apply combine w/ function or purrr package pmap?

This is a big problem for me to solve. If I had enough reputation to award a bounty I would! Looking to balance territories of accounts of sales reps. I have the process broken up, and I don't really know how to do it across each region. In this…
Matt W.
  • 3,692
  • 2
  • 23
  • 46
3
votes
4 answers

Group by columns, then compute mean and sd of every other column in R

How do I group by columns, then compute the mean and standard deviation of every other column in R? As an example, consider the famous Iris data set. I want to do something similar to grouping by species, then compute the mean and sd of the…
I Like to Code
  • 7,101
  • 13
  • 38
  • 48
3
votes
1 answer

How to separate factor interactions in R

I recently had to graph some data based on an interaction of factors and I found it more difficult than I felt something this common should be in R. I suspect I'm missing something. Let's say I have a vector of 30 numbers along with a pair of…
pglezen
  • 961
  • 8
  • 18
3
votes
2 answers

Fastest Way to Split Data Frame by Group, shuffle single vector in R

I am familiar with some of the split-apply-combine functions in R, like ddply, but I am unsure how to split a data frame, modify a single variable within each subset, and then recombine the subsets. I can do this manually, but there is surely a…
Michael Davidson
  • 1,391
  • 1
  • 14
  • 31
1
2
3
10 11