File organization - how to handle different combinations of filters on one data.frame efficiently?

Question

I currently do a lot of descriptive analysis in R. I always work with a data.table like df

net <- seq(1,20,by=2)
gross <- seq(2,20,by=2)
color <- c("green", "blue", "white")
height <- c(170,172,180,188)

library(data.table)
df <- data.table(net,gross,color,height)

In order to obtain results, I do apply a lot of filters. Sometimes I use one filter, sometimes I use a combination of multiple filters, e.g.:

df[color=="green" & height>175]

In my real data.table, I have 7 columns and all kind of filter-combinations. Since I always address the same data.table, I'd like to find the most efficient way to filter the data.

So far, my files are organized like this (bottom-up):

execution level: multiple R-scripts with a very specific job (no interaction between them) that calculate and write the results to an excel file using XL Connect
source file: this file receives a pre-filtered data.table and sources all files from the execution level. It is necessary in case I add/remove files on the execution level.
filter files: read the data.table and apply one or multiple filters, as shown above with df_green_high. By filtering, filter files create a new data.table and source the "source file" with this new filtered table.

I am currently challenged, since I have too many filter files. Having 7 variables, there is such a large number of combinations of filter, so I'll get lost sooner or later.

How can I do my analysis more efficient (reduce the number of "filter files"?)
How can I conveniently name the exported files according to the filters used?

I have read Workflow for statistical analysis and report writing and some other similar questions. However, in this case, I always refer to the same basic table, so there should be a more efficient way. I do not have a CS background, so any help is highly appreciated. On SOF, I also read about creating a package, but I am not sure if this reasonable.

Why do you load `dplyr`? What's wrong with just `df[color=="green" & height>175]`? `data.table` uses automatic indexing under the hood (optimizing operations such `==` and `%in%`, etc.), thus those operations should be very efficient. You can also set the keys by yourself and perform a binary join, for example `setkey(df, color) ; df[J("green")]`. Though this operation will have advantage over just `==` only on the first run before `data.table` will set a secondary key and then your code will be fully optimized. — David Arenburg, Mar 13 '15 at 13:36
I changed the code for the filter as you suggested. However, this is not the focus of my problem ;-( — rmuc8, Mar 13 '15 at 13:43

score 1 · Answer 1 · answered Mar 13 '15 at 14:14

I usually do it like this:

create a list called say "my_case_list"
filter data, do computation on the filtered data
add a column called "case" to each filtered dataset. Fill this column with some string i.e. "case 1: color=="green" & height>175"
put this data to my_case_list
convert list to data.frame like object
export results to sql server
import results from sql server to Excel Pivot table
make sense of results

Automate the process as much as possible.

File organization - how to handle different combinations of filters on one data.frame efficiently?

1 Answers1