Questions tagged [split-apply-combine]

Split-apply-combine operations refer to a common type data manipulation where a function/statistic is computed on several chunks of data independently. The chunks are defined by the value of one variable.

Split-apply-combine operations refer to a common type data manipulation where a function/statistic is computed on several chunks of data independently. The chunks are defined by the value of one variable. As the name implies, they are composed of three parts:

  1. Splitting data by the value of one or more variables
  2. Applying a function to each chunk of data independently
  3. Combining the data back into one piece

Examples of split-apply-combine operations would be:

  • Computing median income by country from individual-level data (possibly appending the result to the same data)
  • Generating highest score for each class from student scores

Tools for streamlining split-apply-combine operations are available for popular statistical computation environments (non-exhaustive list):

  • In the R statistical environment there are dedicated packages for this purpose

    • data.table is an extension of data.frame that is optimized for split-apply-combine operations among other things
    • dplyr and the original package plyr provide convenient syntax and fast processing for such manipulations
  • In Python, the pandas library introduces data objects that include a group-by method for this type of operation.

151 questions
0
votes
2 answers

How to summarise a table by 2 columns in R

I would like to summarise this data set by grouping 1st by period, and 2nd by Payer id so that results are shown as subtotal for any given user by month as follows: data.frame: Payer Period 1 10 1-1015 2 15 2-1015 3 14 3-1015 1 1 …
Chelo F
  • 43
  • 5
0
votes
1 answer

Complicated subtraction in R

I am working on a data-set that requires me to subtract information from columns. It is a repeated measure data-set where one person is tested up to a max of six times and a minimum of two times. The data are in long-format Here's a sample…
Sid0311
  • 89
  • 7
0
votes
2 answers

Adding aggregated counts as extra dataframe rows

I have a data frame with the letters of the English alphabet and their frequency. Now it would be nice to also know the frequency of the vowels and the consonants and the total number of occurrences - and since I want to plot all of this…
not_a_number
  • 305
  • 1
  • 6
  • 18
0
votes
2 answers

Sum certain values from changing dataframe in R

I have a data frame that I would like to aggregate by adding certain values. Say I have six clusters. I then feed data from each cluster into some function that generates a value x which is then put into the output data frame. cluster year …
adaml768
  • 29
  • 1
  • 7
0
votes
1 answer

pandas - Perform computation against a reference record within groups

For each row of data in a DataFrame I would like to compute the number of unique values in columns A and B for that particular row and a reference row within the group identified by another column ID. Here is a toy dataset: d = {'ID' :…
sriramn
  • 2,338
  • 4
  • 35
  • 45
0
votes
2 answers

Collapse a character vector by value in another column r

I have a dataframe with a set of character strings in one column, and a grouping variable (a string, but could be a factor) in another. I'd like to collapse the dataframe such that the strings are collapsed into elements by grouping-variable. For…
sjgknight
  • 393
  • 1
  • 5
  • 19
0
votes
2 answers

R loop over levels of a factor to create a sequence of numbers for each level

I'm working on a dataframe with GPS data from beavers, the dataframe includes on column with the animals id (see $id below) which is a factor with 26 levels. For each beaver, we have several GPS values - the number differs from animal to animal. I…
Pat
  • 217
  • 1
  • 6
0
votes
1 answer

Group androgynous names and sum amount for each year in a data frame in R

I have a data frame with 4 columns titled 'year' 'name' 'sex' 'amount'. Here is a sample data set set.seed(1) data = data.frame(year=sample(1950:2000, 50, replace=TRUE),name=sample(LETTERS, 50, replace=TRUE), …
beck8
  • 35
  • 4
0
votes
1 answer

Java ArrayList adding current item to Previous item; remove current item

Purpose of the code is to iterate thru each item in ArrayList> listOfLists and combine previous list to current list, sort the current list and remove the next list (since already combined). This needs to happen until there is only one list left.…
shivster
  • 107
  • 1
  • 7
0
votes
2 answers

Efficient conditional summing by multiple conditions in R

I'm struggling with finding an efficient solution for the following problem: I have a large manipulated data frame with around 8 columns and 80000 rows that generally includes multiple data types. I want to create a new data frame that includes the…
-1
votes
1 answer

r split-apply-combine problems

I'm new to r and have a large data.frame (906 rows), and I want to (row?) split the data.frame by the first column (entries associated with the same name are together) before I apply multiple descriptive statistics (mean, standard deviation,…
Paige
  • 3
  • 1
-1
votes
1 answer

Combining rows by index in R

EDIT: I am aware there is a similar question that has been answered, but it does not work for me on the dataset I have provided below. The above dataframe is the result of me using the spread function. I am still not sure how to consolidate…
melbez
  • 960
  • 1
  • 13
  • 36
-1
votes
1 answer

A column that's omitted during split-apply-combie in pandas

I'm doing a split-apply-combine to find a total quantity for each member. The dataframe I need should have 14 columns: MemberID, DSFS_0_1, DSFS_1_2, DSFS_2_3, DSFS_3_4, DSFS_4_5, DSFS_5_6, DSFS_6_7, DSFS_7_8, DSFS_8_9, DSFS_9_10, DSFS_10_11,…
-1
votes
2 answers

Remove NAs from each variable (column) and combine cases

I have a dataset that I am cleaning up and have certain rows (observations) which I would like to combine. The best way to explain what I am trying to do is with the following…
rjss
  • 935
  • 10
  • 23
-3
votes
1 answer

Calculate mean and add in new row in R but to reflect in all the entries of a particular column

I have the dataset like below,and I read it as a csv file and load the dataframe as df Name Value1 Value1 A 2 5 A 1 5 B 3 4 B 1 4 C 0 3 C 5 …
Joe
  • 35
  • 1
  • 6
1 2 3
10
11