Questions tagged [split-apply-combine]

Split-apply-combine operations refer to a common type data manipulation where a function/statistic is computed on several chunks of data independently. The chunks are defined by the value of one variable.

Split-apply-combine operations refer to a common type data manipulation where a function/statistic is computed on several chunks of data independently. The chunks are defined by the value of one variable. As the name implies, they are composed of three parts:

  1. Splitting data by the value of one or more variables
  2. Applying a function to each chunk of data independently
  3. Combining the data back into one piece

Examples of split-apply-combine operations would be:

  • Computing median income by country from individual-level data (possibly appending the result to the same data)
  • Generating highest score for each class from student scores

Tools for streamlining split-apply-combine operations are available for popular statistical computation environments (non-exhaustive list):

  • In the R statistical environment there are dedicated packages for this purpose

    • data.table is an extension of data.frame that is optimized for split-apply-combine operations among other things
    • dplyr and the original package plyr provide convenient syntax and fast processing for such manipulations
  • In Python, the pandas library introduces data objects that include a group-by method for this type of operation.

151 questions
2
votes
2 answers

Avoiding the use of for loop for cumsum

First generating some sample data: doy <- rep(1:365,times=2) year <- rep(2000:2001,each=365) set.seed(1) value <-runif(min=0,max=10,365*2) doy.range <- c(40,50,60,80) thres <- 200 df <- data.frame(cbind(doy,year,value)) What I want to do is…
89_Simple
  • 3,393
  • 3
  • 39
  • 94
2
votes
1 answer

Replacing dataframe values by median value of group

Apologies if this is a repeat, I didn't find a similar answer. Big picture: I have a df with NaN values which I would like to replace with an imputed median value for that column. However, the built-in imputers in sklearn that I found use the median…
user7748226
  • 25
  • 1
  • 5
2
votes
1 answer

tidyr or dplyr equivalent of JMP split table

JMP has a "split table" platform: http://www.jmp.com/support/help/Split_Columns.shtml Here is the image for it: The "split by" becomes part of the column headers. The "split columns" are the columns spread out. The "group" are retained columns.…
EngrStudent
  • 1,924
  • 31
  • 46
2
votes
3 answers

Get most common column for each column value

I want the most common letter for each number. I've tried a variety of things; not sure what's the right way. import pandas as pd from pandas import DataFrame, Series original = DataFrame({ 'letter': {0: 'A', 1: 'A', 2: 'A', 3: 'B', 4: 'B'}, …
Hatshepsut
  • 5,962
  • 8
  • 44
  • 80
2
votes
1 answer

How to write a construction of code as a function

I am new to programming and python and would like to write the following piece of code as a function using the 'def' 'return' construction: df.loc[df['DATE_INT'].shift(-1) - df['DATE_INT'] == 1, 'CONSECUTIVE_DAY'] =…
Justin
  • 327
  • 1
  • 3
  • 11
2
votes
2 answers

Generate All ID Pairs, by group with data.table in R

I have a data.table with many individuals (with ids) in many groups. Within each group, I would like to find every combination of ids (every pair of individuals). I know how to do this with a split-apply-combine approach, but I am hoping that a…
Michael Davidson
  • 1,391
  • 1
  • 14
  • 31
2
votes
3 answers

Generate random numbers in an R dataframe which are constant across similar-rows

I have a dataframe containing X rows per 'user', where X is not constant between users. What I would like to do is to be able to generate random numbers to fill a new column, but for each 'user' the random number is the same across all of the rows…
anthr
  • 1,026
  • 4
  • 17
  • 34
2
votes
2 answers

Generate pairings within World Cup tournament groups

I put some data together for the 2015 FIFA Women's World Cup: import pandas as pd df = pd.DataFrame({ 'team':['Germany','USA','France','Japan','Sweden','England','Brazil','Canada','Australia','Norway','Netherlands','Spain', 'China','New…
selwyth
  • 2,417
  • 16
  • 19
2
votes
3 answers

split, apply and combine on 2 columns of data

I've got a dataframe consisting of a group and 2 value columns, as such: group val1 val2 A 5 3 A 2 4 A 3 1 B 3 6 B 2 1 B 0 2 I want to work out the number of rows where val1…
Stuart Lacy
  • 1,963
  • 2
  • 18
  • 30
1
vote
0 answers

How to combine and make an array into a cell of DataFrames.jl?

Let say df = DataFrame(a=[1]) Row │ a │ Int64 ─────┼─────── 1 │ 1 We have Tried to combine data and make a new column holding arrays combine(df, :a => x->[1,2]) Row │ a_function │ Int64 ─────┼──────────── 1 │ 1 …
vochicong
  • 51
  • 4
1
vote
1 answer

How do I split-apply-combine on a dataframe with multi-index columns?

df = pd.DataFrame([[100,90,80,70,36,45], [101,78,65,88,55,78], [92,77,42,79,43,32], [103,98,76,54,45,65]], index = pd.date_range(start='2022-01-01' ,periods=4)) df.columns = pd.MultiIndex.from_tuples((("mkf", "Open"), ("mkf", "Close"), ("tdf",…
jgg
  • 791
  • 4
  • 17
1
vote
1 answer

Simple Split Apply Combine, custom function

I am using the split-apply-combine pattern in pandas to create a new column, which measures the difference between two time stamps. The following is a simplified example of my problem. Say, I have this df df =…
1
vote
1 answer

How to write custom aggregate function in pandas that transforms a series?

So I have a dataframe like this df = pd.DataFrame({'item_id':[1,2,3,4,5,6,7,8,9,10], 'category':['A', 'B', 'A', 'C', 'B', 'B', 'C', 'A', 'A', 'C'], 'sales': [100, 150, 300, 1000, 300, 50, 1000, 600, 700, 100]}) item_id category sales 0 1…
1
vote
1 answer

Pandas combine dataframes by stacking columns with values on matching condition

I want to combine dataframes 1 and 2 in the following way: The date column is one key The second key is the header of dataframe 1 with the Project variable of dataframe 2 In the new dataframe V1 corresponds to the value in dataframe 1 where these…
beerzy
  • 117
  • 1
  • 6
1
vote
1 answer

Pandas: Swapping specific column values within one Dataframe and calculate its weighted averages

There exist the following dataframe: year pop0 pop1 city0 city1 2019 20 40 Malibu NYC 2018 8 60 Sydney Dublin 2018 36 23 NYC Malibu 2020 17 44 Malibu NYC 2019 5 55 Sydney Dublin I would like to calculate the weighted average for…