0

I have a table stored in a dataframe in R.

I want to calculate the first derivative along each column. Columns are measured variables, rows are time.

Can I vectorize this function ?

df$C <- df$A + df$B

In principle I'd like something like :

df$DiffA <- diff(df$A)

The problem is, that I don't know how to vectorize functions that need A(n) and A(n+1), where n is the row within the dataframe (Pseudocode).

Doc
  • 358
  • 1
  • 4
  • 24
  • Please, can you elaborate your pseudo code ? Write your code using for loops if easier, then we will see if it's possible to vectorize it... – digEmAll Sep 28 '12 at 12:22
  • well, as rows are time and time-intervals are equally spaced, the interval can be ignored for the purpose of differentiation. Therefore, I´m searching for `diffA(n) = A(n+1) - A(n)`. – Doc Sep 28 '12 at 12:30
  • 1
    `A[-1]-A[-length(A)]` which is essentially how `diff` works – James Sep 28 '12 at 12:37
  • I do not understand, what is actually asked here. It seems like `diff` is the answer. However, if it is not, the question needs to be rephrased. – Roland Sep 28 '12 at 12:42
  • What does `df$C <- df$A + df$B` mean? What do you want with it? – JACKY88 Sep 28 '12 at 12:42
  • `diff` doesn't work on dataframes (as far as I tried)... – Doc Sep 28 '12 at 12:50
  • @Patrick Li: just to clarify, that I want to stay within the same dataframe and not to loop through its rows or something similar. I´m talking about some 100.000 rows here, not a 3x3 table... – Doc Sep 28 '12 at 12:52
  • @Doc: do you want to apply the diff function on each column of the data.frame ? If so, use `apply(df,MARGIN=2,FUN=diff)` – digEmAll Sep 28 '12 at 12:55
  • @Roland: I running out of ideas how to rephrase the problem here... I need the first derivative of one column within a dataframe. `df <- data.frame(c(1:100)) colnames(df)<-c("n") df$sqrt<-df$n^0.5 df$diff<-diff(df$sqrt,lag=1)` ... obviously doesn´t work... – Doc Sep 28 '12 at 12:57
  • @Doc: that code doesn't work because diff returns one less element than the original vector, add a 0 (or whatever) at the beginning or at the end and it will work. e.g. `df$diff <- c(0 , diff(df$sqrt,lag=1))` – digEmAll Sep 28 '12 at 13:02
  • @digEmAll: well, not every column. Just one specific. And as I said, I can´t get diff to work. I guess because I´m ending up with one line shorter than the initial dataframe. Any solutions? – Doc Sep 28 '12 at 13:03
  • @Doc: I answered to you in my previous comment, and also Roland answer seems the code you're looking for... – digEmAll Sep 28 '12 at 13:08
  • @digEmAll: sorry! just overlap due to writing! Thanks for Your help!!! – Doc Sep 28 '12 at 15:47

2 Answers2

1

Based on the comments:

df <- data.frame(n=1:100) 
df$sqrt <- sqrt(df$n)
df$diff <- c(NA,diff(df$sqrt,lag=1))

diff returns one value less then there are values in the input vector (for obvious reasons). You can fix that by prepending or appending an NA value.

Some timings:

#create a big data.frame
vec <- 1:1e6
df <- data.frame(a=vec,b=vec,c=vec,d=vec,e=vec,sqroot=sqrt(vec))

#for big datasets data.table is usually more efficient:
library(data.table)
dt <- data.table(df)

#benchmarks
library(microbenchmark)

microbenchmark(df$diff <- c(NA,diff(df$sqroot,lag=1)),
               dt[,diff:=c(NA,diff(sqroot,lag=1))])
Unit: milliseconds
                                            expr      min        lq    median        uq      max
1     df$diff <- c(NA, diff(df$sqroot, lag = 1)) 75.42700 116.62366 140.98300 151.11432 174.5697
2 dt[, `:=`(diff, c(NA, diff(sqroot, lag = 1)))] 37.39592  45.91857  52.21005  62.89996 119.7345

diff is fast, but for big datasets using a data.frame is not efficient. Use data.table instead. The speed gain gets more pronounced, the bigger the dataset is.

Roland
  • 127,288
  • 10
  • 191
  • 288
  • Thanks Roland. That obviously solves the coding problem. Just one minor question: Is this still vectorized? Or is it a function looping through the dataframe? As I wrote earlier, I´m facing a dataframe close to 500MB continues data (some 100 thousand measurements). Will this perform adequately? – Doc Sep 28 '12 at 15:50
  • 1
    `diff` is pretty fast. If that is not sufficient, you should ask for a more efficient alternative in a new question. Try to make it more clear what you are actually asking next time. If you had provided the example code in your comment from the beginning, this question could have been answer much faster. – Roland Sep 28 '12 at 18:18
  • That is really cool code! Thanks! I didn´t know about the data.table option You used. – Doc Oct 01 '12 at 11:58
0

You might try the lag() or diff() functions. They would seem to do what you want.

Chris
  • 418
  • 3
  • 10