Questions tagged [split-apply-combine]

Split-apply-combine operations refer to a common type data manipulation where a function/statistic is computed on several chunks of data independently. The chunks are defined by the value of one variable.

Splitting data by the value of one or more variables
Applying a function to each chunk of data independently
Combining the data back into one piece

Examples of split-apply-combine operations would be:

Computing median income by country from individual-level data (possibly appending the result to the same data)
Generating highest score for each class from student scores

Tools for streamlining split-apply-combine operations are available for popular statistical computation environments (non-exhaustive list):

In the R statistical environment there are dedicated packages for this purpose
- data.table is an extension of data.frame that is optimized for split-apply-combine operations among other things
- dplyr and the original package plyr provide convenient syntax and fast processing for such manipulations
In Python, the pandas library introduces data objects that include a group-by method for this type of operation.

151 questions

votes

2 answers

Avoiding the use of for loop for cumsum

First generating some sample data: doy <- rep(1:365,times=2) year <- rep(2000:2001,each=365) set.seed(1) value <-runif(min=0,max=10,365*2) doy.range <- c(40,50,60,80) thres <- 200 df <- data.frame(cbind(doy,year,value)) What I want to do is…

asked Dec 12 '17 at 12:17

89_Simple

3,393
3
39
94

votes

1 answer

Replacing dataframe values by median value of group

Apologies if this is a repeat, I didn't find a similar answer. Big picture: I have a df with NaN values which I would like to replace with an imputed median value for that column. However, the built-in imputers in sklearn that I found use the median…

python pandas split-apply-combine

asked Jul 26 '17 at 20:07

user7748226

votes

1 answer

tidyr or dplyr equivalent of JMP split table

JMP has a "split table" platform: http://www.jmp.com/support/help/Split_Columns.shtml Here is the image for it: The "split by" becomes part of the column headers. The "split columns" are the columns spread out. The "group" are retained columns.…

r dplyr split-apply-combine

asked Jun 12 '17 at 17:32

EngrStudent

1,924
31
46

votes

3 answers

Get most common column for each column value

I want the most common letter for each number. I've tried a variety of things; not sure what's the right way. import pandas as pd from pandas import DataFrame, Series original = DataFrame({ 'letter': {0: 'A', 1: 'A', 2: 'A', 3: 'B', 4: 'B'}, …

python pandas split-apply-combine

asked Jun 07 '17 at 05:06

Hatshepsut

5,962
8
44
80

votes

1 answer

How to write a construction of code as a function

I am new to programming and python and would like to write the following piece of code as a function using the 'def' 'return' construction: df.loc[df['DATE_INT'].shift(-1) - df['DATE_INT'] == 1, 'CONSECUTIVE_DAY'] =…

python python-2.7 function pandas split-apply-combine

asked Aug 04 '16 at 19:26

Justin

votes

2 answers

Generate All ID Pairs, by group with data.table in R

I have a data.table with many individuals (with ids) in many groups. Within each group, I would like to find every combination of ids (every pair of individuals). I know how to do this with a split-apply-combine approach, but I am hoping that a…

r data.table split-apply-combine

asked May 19 '16 at 21:07

Michael Davidson

1,391
1
14
31

votes

3 answers

Generate random numbers in an R dataframe which are constant across similar-rows

I have a dataframe containing X rows per 'user', where X is not constant between users. What I would like to do is to be able to generate random numbers to fill a new column, but for each 'user' the random number is the same across all of the rows…

r dataframe split-apply-combine

asked Nov 25 '15 at 19:51

anthr

1,026
4
17
34

votes

2 answers

Generate pairings within World Cup tournament groups

I put some data together for the 2015 FIFA Women's World Cup: import pandas as pd df = pd.DataFrame({ 'team':['Germany','USA','France','Japan','Sweden','England','Brazil','Canada','Australia','Norway','Netherlands','Spain', 'China','New…

python r pandas plyr split-apply-combine

asked Jun 02 '15 at 21:04

selwyth

2,417
16
19

votes

3 answers

split, apply and combine on 2 columns of data

I've got a dataframe consisting of a group and 2 value columns, as such: group val1 val2 A 5 3 A 2 4 A 3 1 B 3 6 B 2 1 B 0 2 I want to work out the number of rows where val1…

r split-apply-combine

asked Dec 04 '14 at 10:56

Stuart Lacy

1,963
2
18
30

vote

0 answers

How to combine and make an array into a cell of DataFrames.jl?

Let say df = DataFrame(a=[1]) Row │ a │ Int64 ─────┼─────── 1 │ 1 We have Tried to combine data and make a new column holding arrays combine(df, :a => x->[1,2]) Row │ a_function │ Int64 ─────┼──────────── 1 │ 1 …

arrays tuples split-apply-combine dataframes.jl

asked Nov 29 '22 at 08:25

vochicong

vote

1 answer

How do I split-apply-combine on a dataframe with multi-index columns?

df = pd.DataFrame([[100,90,80,70,36,45], [101,78,65,88,55,78], [92,77,42,79,43,32], [103,98,76,54,45,65]], index = pd.date_range(start='2022-01-01' ,periods=4)) df.columns = pd.MultiIndex.from_tuples((("mkf", "Open"), ("mkf", "Close"), ("tdf",…

python pandas split-apply-combine

asked Jan 28 '22 at 13:20

jgg

vote

1 answer

Simple Split Apply Combine, custom function

I am using the split-apply-combine pattern in pandas to create a new column, which measures the difference between two time stamps. The following is a simplified example of my problem. Say, I have this df df =…

python pandas dataframe group-by split-apply-combine

asked Jul 29 '21 at 07:13

yash agarwal

vote

1 answer

How to write custom aggregate function in pandas that transforms a series?

So I have a dataframe like this df = pd.DataFrame({'item_id':[1,2,3,4,5,6,7,8,9,10], 'category':['A', 'B', 'A', 'C', 'B', 'B', 'C', 'A', 'A', 'C'], 'sales': [100, 150, 300, 1000, 300, 50, 1000, 600, 700, 100]}) item_id category sales 0 1…

python pandas group-by aggregation split-apply-combine

asked Jun 23 '21 at 18:39

André Luiz Tiago Soares

vote

1 answer

Pandas combine dataframes by stacking columns with values on matching condition

I want to combine dataframes 1 and 2 in the following way: The date column is one key The second key is the header of dataframe 1 with the Project variable of dataframe 2 In the new dataframe V1 corresponds to the value in dataframe 1 where these…

python pandas dataframe split-apply-combine

asked Mar 24 '21 at 23:16

beerzy

vote

1 answer

Pandas: Swapping specific column values within one Dataframe and calculate its weighted averages

There exist the following dataframe: year pop0 pop1 city0 city1 2019 20 40 Malibu NYC 2018 8 60 Sydney Dublin 2018 36 23 NYC Malibu 2020 17 44 Malibu NYC 2019 5 55 Sydney Dublin I would like to calculate the weighted average for…

python pandas dataframe sorting split-apply-combine

asked Jan 07 '21 at 01:29

Royal.Flush

Prev 1 2 3

…

10 11 Next