1

I currently do a lot of descriptive analysis in R. I always work with a data.table like df

net <- seq(1,20,by=2)
gross <- seq(2,20,by=2)
color <- c("green", "blue", "white")
height <- c(170,172,180,188)

library(data.table)
df <- data.table(net,gross,color,height)

In order to obtain results, I do apply a lot of filters. Sometimes I use one filter, sometimes I use a combination of multiple filters, e.g.:

df[color=="green" & height>175]

In my real data.table, I have 7 columns and all kind of filter-combinations. Since I always address the same data.table, I'd like to find the most efficient way to filter the data.

So far, my files are organized like this (bottom-up):

  • execution level: multiple R-scripts with a very specific job (no interaction between them) that calculate and write the results to an excel file using XL Connect
  • source file: this file receives a pre-filtered data.table and sources all files from the execution level. It is necessary in case I add/remove files on the execution level.
  • filter files: read the data.table and apply one or multiple filters, as shown above with df_green_high. By filtering, filter files create a new data.table and source the "source file" with this new filtered table.

I am currently challenged, since I have too many filter files. Having 7 variables, there is such a large number of combinations of filter, so I'll get lost sooner or later.

  • How can I do my analysis more efficient (reduce the number of "filter files"?)

  • How can I conveniently name the exported files according to the filters used?

I have read Workflow for statistical analysis and report writing and some other similar questions. However, in this case, I always refer to the same basic table, so there should be a more efficient way. I do not have a CS background, so any help is highly appreciated. On SOF, I also read about creating a package, but I am not sure if this reasonable.

Community
  • 1
  • 1
rmuc8
  • 2,869
  • 7
  • 27
  • 36
  • 1
    Why do you load `dplyr`? What's wrong with just `df[color=="green" & height>175]`? `data.table` uses automatic indexing under the hood (optimizing operations such `==` and `%in%`, etc.), thus those operations should be very efficient. You can also set the keys by yourself and perform a binary join, for example `setkey(df, color) ; df[J("green")]`. Though this operation will have advantage over just `==` only on the first run before `data.table` will set a secondary key and then your code will be fully optimized. – David Arenburg Mar 13 '15 at 13:36
  • I changed the code for the filter as you suggested. However, this is not the focus of my problem ;-( – rmuc8 Mar 13 '15 at 13:43

1 Answers1

1

I usually do it like this:

  • create a list called say "my_case_list"
  • filter data, do computation on the filtered data
  • add a column called "case" to each filtered dataset. Fill this column with some string i.e. "case 1: color=="green" & height>175"
  • put this data to my_case_list
  • convert list to data.frame like object
  • export results to sql server
  • import results from sql server to Excel Pivot table
  • make sense of results

Automate the process as much as possible.

Miha Trošt
  • 2,002
  • 22
  • 25