Questions tagged [split-apply-combine]

Split-apply-combine operations refer to a common type data manipulation where a function/statistic is computed on several chunks of data independently. The chunks are defined by the value of one variable.

Split-apply-combine operations refer to a common type data manipulation where a function/statistic is computed on several chunks of data independently. The chunks are defined by the value of one variable. As the name implies, they are composed of three parts:

  1. Splitting data by the value of one or more variables
  2. Applying a function to each chunk of data independently
  3. Combining the data back into one piece

Examples of split-apply-combine operations would be:

  • Computing median income by country from individual-level data (possibly appending the result to the same data)
  • Generating highest score for each class from student scores

Tools for streamlining split-apply-combine operations are available for popular statistical computation environments (non-exhaustive list):

  • In the R statistical environment there are dedicated packages for this purpose

    • data.table is an extension of data.frame that is optimized for split-apply-combine operations among other things
    • dplyr and the original package plyr provide convenient syntax and fast processing for such manipulations
  • In Python, the pandas library introduces data objects that include a group-by method for this type of operation.

151 questions
0
votes
2 answers

Calculating age per animal by subtracting years in R

I am looking to calculate relative age of animals. I need to subtract sequentially each year from the next for each animal in my dataset. Because an animal can have multiple reproductive events in a year, I need the age for the remaining events in…
Constantin
  • 132
  • 9
0
votes
3 answers

Calculating an average for unique value combinations

I have a data set with the following columns: locID = the location of ID of the observer yr = the year of the observation in categorical format: P_year maxFlock = a number counted by the observer lat = latitude of the…
Heliornis
  • 391
  • 5
  • 18
0
votes
1 answer

Pandas Running Subtotal with Filtering - Apply and Lambda?

I'm trying to build something, that for each record in a pandas database, will show the total for a given column and also show the total for certain records in a given column that occur prior to the date of that record. Note that the comparison…
0
votes
2 answers

Operate on columns based on a variable

I have the following data df <- structure(list(year = c(2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2016L,…
Morpheu5
  • 2,610
  • 6
  • 39
  • 72
0
votes
1 answer

Assign column value (split apply combine) based on another column

I have data of format set.seed(40) subject <- sample(c("mike", "john", "steve"), 20, replace = TRUE) test1 <- sample(c("pos", "neg", "pos", "neg", "NA"), 20, replace = TRUE) testdate <- Sys.Date() + sample(-1000:1000, 20, replace = FALSE) mydf <-…
marcel
  • 389
  • 1
  • 8
  • 21
0
votes
0 answers

Applying a function for all subsets of a dataframe

I am trying to apply a function to each and every subsets of a data frame based on values in two columns. The following example is a simplified representation of my problem. Say I have the following data…
rm167
  • 1,185
  • 2
  • 10
  • 26
0
votes
2 answers

R Data transform - Columns to Rows and aggregate

I'm struggling with a data transformation in R. The data I receive is of this type: input <- data.frame(AF = sample(0:1, 100, replace=TRUE), CAD = sample(0:1, 100, replace=TRUE), CHF = sample(0:1, 100,…
0
votes
2 answers

How can I sort DF and subtotal based on Profit and NumberDays

I have data in a CSV that looks like this.. CUSIP BuyDate SellDate BuyAmount SellAmount Profit DaysHolding Over365Days 037833100 12/1/2015 3/1/2017 45 27 -18 456 1 17275R102 1/28/2016 2/21/2017 28 25 -3 390 1 38259P508 …
ASH
  • 20,759
  • 19
  • 87
  • 200
0
votes
0 answers

split apply combine with dplyr do, and function

I have a dataframe I'm splitting by grouping and then running a function on each of the grouped portions with do(). The problem I'm having is that there is a variable inside the function that needs to change based on each different group. How can I…
Matt W.
  • 3,692
  • 2
  • 23
  • 46
0
votes
1 answer

Processing lists of lists by group

I would like to process a list of lists. Specifically I want to extract the dataframe that is the third member of each list by a grouping variable (the first member of each list) and then use several functions like mean(), median(), sd(), length()…
TBP
  • 697
  • 6
  • 16
0
votes
3 answers

Cut a variable differently based on another grouping variable

Example: I have a dataset of heights by gender. I'd like to split the heights into low and high where the cut points are defined as the mean - 2sd within each gender. example dataset: set.seed(8) df = data.frame(sex = c(rep("M",100),…
Brian D
  • 2,570
  • 1
  • 24
  • 43
0
votes
1 answer

R - Conditional IF Minus Each Row Matching Condition

My data set contains a column for product type and for purchase quantity. I would like to be able to subtract the average purchase quantity for each product type from the actual purchase on each line. I have a data set that looks roughly like…
Nick Criswell
  • 1,733
  • 2
  • 16
  • 32
0
votes
0 answers

Matlab2016 splitapply with a function that has a nonscalar output

I have a table in Matlab2016 and I'd like to apply a function on groupings of a column. I know the splitapply function can do this but I'd like to use a function such as tiedrank where the output is nonscalar but still specific to the entries in the…
I.S
  • 1
  • 1
0
votes
0 answers

Compute z-score by two groups

I have a repeated measures data-set that I am working on. The data look like this: ID=c('X1', 'X1', 'X1', 'X1', 'X2', 'X2', 'X2', 'X3', 'X3', 'X3', 'X3', 'X4', 'X4', 'X4', 'X4', 'X5', 'X5', 'X5', 'X6', 'X6', 'X6', 'X6') Diag=c('Con', 'Con', 'Con',…
Sid0311
  • 89
  • 7
0
votes
0 answers

Calling objects as column name in R Combine Function

I have an array(pstype) whose elements are column names for another data frame, i would like to call a column name[Array Element] one by one from that array by attaching ".x" and ".y" with it and put those columns in a new data…
Abhijeet Arora
  • 237
  • 3
  • 13