Questions tagged [split-apply-combine]

Split-apply-combine operations refer to a common type data manipulation where a function/statistic is computed on several chunks of data independently. The chunks are defined by the value of one variable.

Split-apply-combine operations refer to a common type data manipulation where a function/statistic is computed on several chunks of data independently. The chunks are defined by the value of one variable. As the name implies, they are composed of three parts:

  1. Splitting data by the value of one or more variables
  2. Applying a function to each chunk of data independently
  3. Combining the data back into one piece

Examples of split-apply-combine operations would be:

  • Computing median income by country from individual-level data (possibly appending the result to the same data)
  • Generating highest score for each class from student scores

Tools for streamlining split-apply-combine operations are available for popular statistical computation environments (non-exhaustive list):

  • In the R statistical environment there are dedicated packages for this purpose

    • data.table is an extension of data.frame that is optimized for split-apply-combine operations among other things
    • dplyr and the original package plyr provide convenient syntax and fast processing for such manipulations
  • In Python, the pandas library introduces data objects that include a group-by method for this type of operation.

151 questions
0
votes
1 answer

How do I efficiently process a large dataframe row-by-row?

I have a large dataframe (10,000,000+ rows) that I would like to process. I'm also fairly new to R, and want to better understand how to work with large datasets like this. I have a formula that I want to apply to each row in the dataframe. But I've…
Andrew
  • 85
  • 2
  • 6
0
votes
1 answer

Pandas: efficient way to combine dataframes

I'm looking for a more efficient way than pd.concat to combine two pandas DataFrames. I have a large DataFrame (~7GB in size) with the following columns - "A", "B", "C", "D". I want to groupby the frame by "A", then for each group: groupby by "B",…
Guy
  • 63
  • 1
  • 7
0
votes
0 answers

R: Date Parsing error when using Lubridate ymd() in R function, same code works outside the function I have written

I am new to stack overflow and probably would not have joined if I could have found an answer to this on my own. I am dealing with a fairly simple problem its a classic split-apply-combine task. I have a time series tibble containing environmental…
0
votes
1 answer

h2o.ai summary functions incompatible with group_by

I am trying to (substantially) accelerate some R code by moving to R+h2o.ai. I am grouping by a single factor variable but I get error when I try to compute windowed quantiles, skewness, or kurtosis. Is there a list of summary functions in h2o that…
EngrStudent
  • 1,924
  • 31
  • 46
0
votes
1 answer

How to use splitapply/findgroups on clustered/non-sequential graphs?

I need to implement splitapply function to non-sequential node index in graph. I implemented the splitapply function on a graph that has non-sequential clusters. The index numbers of the returned clusters were sequentially numbered but the graph…
0
votes
0 answers

combine the values of a specific column of a dataframe in one row or unit

I want to combine the different values/rows of a certain column. these values are texts and I want to combine them together to perform word count and find the most common words. the dataframe is called df and is made of 30 columns. I want to…
Talal Ghannam
  • 189
  • 2
  • 17
0
votes
1 answer

Split - Apply - Combine with dist_google() function (stplanr package)

I have the following data frame of long/lat points (points): GPSLatitude GPSLongitude 1 40.66126 22.89565 2 40.66127 22.89565 3 40.66128 22.89565 4 40.66130 22.89566 5 40.66131 22.89567 6 40.66132 …
Anas.S
  • 193
  • 1
  • 11
0
votes
3 answers

Create logical list with strsplit on combined words to subset data frame

I have tried to subset my data frame according a condition on specific column. For this purpose I need to create TRUE or FALSE info for each line on this column. But some line on this column has combine words and my code can not detect them. p <-…
eabanoz
  • 251
  • 3
  • 17
0
votes
1 answer

How to add a condition in split apply combine and repeat solution on each row?

I have the following pandas dataframe df: cluster tag amount name 1 0 200 Michael 2 1 1200 John 2 1 900 Daniel 2 0 3000 David 2 0 600…
callmeGuy
  • 944
  • 2
  • 11
  • 28
0
votes
2 answers

Combine column in data.frame by the same symbols

I want to combine a list of data.frame by the same symbols in text. Here is my data: d1 <- data.frame(Name = c("aaa", "bbb", "ccc","ddd","ggg", "eee"), ID = c("123", "456", "789", "101112", "131415", "161718"), stringsAsFactors = FALSE) d2 <-…
Denis
  • 159
  • 2
  • 4
  • 17
0
votes
0 answers

Convert Pandas DF to Nested JSON

I have a dataframe in pandas of the following form: # df name: cust_sim_data_product_agg: yearmo region products revenue 0 201711 CN ['Auto', 'Flood', 'Home', 'Liability', 'Life',... 690 1 201711 CN ['Auto', 'Flood', 'Home',…
Sledge
  • 1,245
  • 1
  • 23
  • 47
0
votes
1 answer

Pandas - directly add moving average columns from group by to dataframe

I have a dataframe with the following columns: name, date, day_index, value I want to add a 4th column to the same dataframe which is the exponentially weighted moving average of the 3rd column (value) for each name, sorted by first date and then…
oli5679
  • 1,709
  • 1
  • 22
  • 34
0
votes
2 answers

Combining several pairs of columns (containing numbers and NA's) simultaneously in R

I'm trying to determine how to combine columns efficiently. I've started with a dataframe that looks somewhat like the following. The variable names do not follow any specific pattern, and the columns I am trying to combine are not necessarily next…
melbez
  • 960
  • 1
  • 13
  • 36
0
votes
1 answer

A loop to create multiple data frames from a population data frame

Suppose I have a data frame called pop, and I wish to split this data frame by a categorical variable called replicate. This replicate consists out of 110 categories, and I wish to perform analyses on each data frame then the output of each must be…
wild west
  • 15
  • 3
0
votes
3 answers

Merging two columns with NAs and duplicated values

Subset of data frame: country1 country2 Japan Japan Netherlands Brazil …
Nautica
  • 2,004
  • 1
  • 12
  • 35