1

sometimes ago I asked the following question:

I have a list of deals with trading day and market value. Every (Trading)day new positions come into the list but the old one never disappear (when positions expire the value stays just constant). The list looks like as follows:

Deal Trade_Date MktValue Desired_Col
Deal1 31.08.2012 10 +10
Deal2 31.08.2012 21 +21
Deal1 03.09.2012 12 +2
Deal2 03.09.2012 19 -2
Deal3 03.09.2012 2  +2

I would like for each deal to get the difference to the previous trade date (Desidered_Col in the above example).

And the following solution was provided to me by Roland:

df <- read.table(text="Deal Trade_Date MktValue Desidered_Col Deal1 31.08.2012 10 +10 Deal2 31.08.2012 21 +21 Deal1 03.09.2012 12 +2 Deal2 03.09.2012 19 -2 Deal3 03.09.2012 2 +2",header=TRUE)

library(data.table) dt <- as.data.table(df)

diff.padded <- function(x) c(x[1],diff(x)) dt[,Desidered_Col2:=diff.padded(MktValue),by=Deal]

    Deal Trade_Date MktValue Desired_Col Desired_Col2
1: Deal1 31.08.2012       10            10             10
2: Deal2 31.08.2012       21            21             21
3: Deal1 03.09.2012       12             2              2
4: Deal2 03.09.2012       19            -2             -2
5: Deal3 03.09.2012        2             2              2

The solution works perfectly with data.table. However given the size of my table I decided to try to work with an ffdf object. Hence I have now my data in a ffdf file and I am trying to reproduce the same solution unfortunately without success. Do you have any advise how I can reproduce that in a ffdf? Thanks for your help.

here is the full code I am running:

# Load needed packages
library(RODBC)
library(data.table)
library(ETLUtils)
library(RSQLite)
library(ffbase)

calendar <- read.csv("Trading_Calendar.csv",sep=";",stringsAsFactors=FALSE)
calendar$STICHTAG <- as.Date(calendar$STICHTAG,"%d.%m.%Y")

ST_a=Sys.Date()-2
rd_a=as.Date("13.11.2012","%d.%m.%Y")
ST=paste("'",as.character(format(ST_a,"%d.%m.%Y")),"'",sep="")
rd=paste("'",as.character(format(rd_a,"%d.%m.%Y")),"'",sep="")

gc(TRUE)

st.strom <- calendar[calendar$STICHTAG>=rd_a & calendar$STICHTAG<=ST_a &   calendar$BR_Strom==1,"STICHTAG"]
st.strom <- format(st.strom,"%d.%m.%Y")
st.strom.s <- paste("('",do.call(paste, c(as.list(as.character(st.strom)), sep="','")),"')",sep="")


started.at=proc.time()
Sys.sleep(1)

memory.limit(size=4095)


query <- paste("select * from is_bewertung_data where commodity in ('CASH','COAL','CO2','ELEC','GCERT') 
               and stichtag in ",st.strom.s,sep="")

deals.strom <- read.odbc.ffdf(query = query,odbcConnect.args=list(dsn="dsn",uid="id",pwd="pwd"),
                       first.rows = 100000, next.rows = 500000, VERBOSE=TRUE)

result <- ffdfdply(deals.strom, deals.strom$DEALID, FUN=function(x){ 
  x <- split(x, x$DEALID)
  x <- lapply(x, FUN=function(onlyonedeal){
    onlyonedeal$Desidered_Col2 <- c(NA, -diff(onlyonedeal$STICHTAG))
    onlyonedeal
  })
  x <- do.call(rbind, x)      
  x
})
cat("Finished in",timetaken(started.at),"\n")

here the result of str(deals.strom[1:5,]):

'data.frame':   5 obs. of  39 variables:
 $ ABBREVIATION   : Factor w/ 33553 levels " C 251"," TÜV EE Donaustrom",..: 1893 1892 1894 1895 1896
 $ TRADEDATE      : POSIXct, format: "2007-06-19" "2007-06-19" "2007-06-19" ...
 $ BOOK           : Factor w/ 30 levels "CR_RIR_RISKRED",..: 10 10 10 10 10
 $ CONTRACT       : Factor w/ 20 levels "Base","DNULL",..: 1 5 5 1 1
 $ BUYSELL        : Factor w/ 2 levels "BUY","SELL": 2 1 2 1 1
 $ RATE           : num  54.2 57.2 57.3 54.2 55.1
 $ AMOUNT         : num  474792 501072 501773 474792 964476
 $ CUR            : Factor w/ 2 levels "EUR","USD": 1 1 1 1 1
 $ VOLUME         : num  8760 8760 8760 8760 17520
 $ UNIT           : Factor w/ 2 levels "MWH","t": 1 1 1 1 1
 $ STARTDATE      : POSIXct, format: "2010-01-01" "2010-01-01" "2010-01-01" ...
 $ ENDDATE        : POSIXct, format: "2011-01-01" "2011-01-01" "2011-01-01" ...
 $ BROKERAGE      : num  0 0 0 0 175
 $ DV             : num  85078 -98218 98919 -85078 -185048
 $ REALIZED       : num  85078 -98218 98919 -85078 -185048
 $ PV             : num  0 0 0 0 0
 $ DV_DAY         : num  0 0 0 0 0
 $ DV_MONTH       : num  0 0 0 0 0
 $ DV_YEAR        : num  0 0 0 0 0
 $ TRADER         : Factor w/ 16 levels "Adolf Plentz",..: 7 7 7 7 12
 $ ACTIVE         : Factor w/ 2 levels "LONGTERM","SHORTTERM": 2 2 2 2 2
 $ STATUS         : Factor w/ 2 levels "GCPTY","INT": 1 1 2 2 1
 $ PV_MIN         : num  0 0 0 0 0
 $ PV_PLUS        : num  0 0 0 0 0
 $ VERTRAGSPARTY  : Factor w/ 21 levels "EDL_G059","EDL_G097",..: 10 10 3 3 10
 $ GESELLSCHAFT   : Factor w/ 1 level "24/7 Trading": 1 1 1 1 1
 $ COMMODITY      : Factor w/ 5 levels "CASH","CO2","COAL",..: 4 4 4 4 4
 $ TO_BE_DELIVERED: num  0 0 0 0 0
 $ ACCOUNT        : Factor w/ 8 levels "CR_RISKRED","HO_COAL",..: 5 5 5 5 5
 $ VERW_PREIS     : num  0 0 0 0 0
 $ PV_ND          : num  0 0 0 0 0
 $ BILANZIERUNG   : Factor w/ 2 levels "JA","NEIN": 1 1 1 1 1
 $ MOTIV          : Factor w/ 8 levels "Emissionszertifikate",..: 4 4 4 4 4
 $ STICHTAG       : POSIXct, format: "2012-11-13" "2012-11-13" "2012-11-13" ...
 $ DEALID         : Factor w/ 59704 levels "FUX.E.EEX.K.20090622.002",..: 7175 7103 12584 12500 17985
 $ COUNTERPARTY   : Factor w/ 174 levels "24sieben GmbH",..: 171 171 53 53 141
 $ COMMODITY2     : Factor w/ 8 levels "CASH","CER","COAL",..: 4 4 4 4 4
 $ MARKTGEBIET    : Factor w/ 3 levels "Kohle","Strom",..: 2 2 2 2 2
 $ INSTRUMENT     : Factor w/ 88 levels "-","Elektrizität FUX EEX Base Apr11 EEXFUT",..: 1 1 1 1 1

my solution after Jan hint, not working:

test <- as.ffdf(deals.strom[,c("DEALID","STICHTAG","PV")])
test <- transform(test,chg=c(NA,diff(PV)),chg2=c(NA,-diff(PV)))
fdd <- as.ff(!duplicated(test$DEALID))
test[fdd,c("chg","chg2")] <- test[fdd,"PV"]

I get the following error msg:error: is.null(rownames(x)) is not TRUE. Somehow I cannot manage to subset the ffdf.

ddg
  • 2,493
  • 2
  • 20
  • 23

2 Answers2

1

Hi I found the following solution. It is working but I would appreciate if you have a more elegant solution. I´m still forced to use objects in RAM and I am worried that if data size increases I´ll have to process the data in pieces (which even less elegant as solution). The data is stored in a ffdf file. I have circa 21Mio. rows and 39 columns.

deals # ffdf with 21Mio. rows and 39 columns
deals <- ffdfsort(deals)

deals <- transform(deals, delta_MktValue=0)
diff.padded <- function(x) c(x[1],diff(x))
delta <- data.table(deals[,c("Deal","Trade_Date","MktValue")])

diff <- delta[,diff.padded(MktValue),by=Deal]

deals[,"delta_MktValue"] <- diff[,V1]

rm(diff)
rm(delta)
rm(delta_PV)
gc()

It is actually working but I would appreciate if someone can suggest a more elegant solution. In particular I would like to perform the calculation directly in the ffdf. Thanks!

ddg
  • 2,493
  • 2
  • 20
  • 23
1

Have you tried out ffdfdply in package ffbase? See e.g. here an example on how to use it. R language: problems computing "group by" or split with ff package.

So in your case do something like (I'm freewheeling here based on your example script, but you should understand the point of split-apply-combine in an ffdf setting)

require(ffbase)
result <- ffdfdply(deals[c("Deal","Trade_Date")], deals$Deal, FUN=function(x){ 
  x$Deal <- as.character(x$Deal)
  x <- split(x, x$Deal)
  x <- lapply(x, FUN=function(onlyonedeal){
    onlyonedeal$Desidered_Col2 <- c(NA, -diff(onlyonedeal$Trade_Date))
    onlyonedeal
  })
  x <- do.call(rbind, x)      
  x
})

Another solution would be. This doesn't use split-apply-rbind explicitely inside FUN.

require(ffbase)
require(doBy)
result <- ffdfdply(deals[c("DEALID","STICHTAG")], deals$DEALID, FUN=function(x){ 
  x <- orderBy(~ DEALID + STICHTAG, data = x)
  x$Desidered_Col2 <- c(NA, -diff(as.Date(x$STICHTAG)))
  firstdealdate <- !duplicated(x$DEALID)
  x$Desidered_Col2[firstdealdate] <- NA
  x
})
Community
  • 1
  • 1
  • Hi, thanks a lot for you answer. I tried the solution you provided, however I get pretty fast into the usual memory problem (cannot allocate etc...). Any suggestion? here what I see:*Garbage collection 4576 = 1839+573+2164 (level 2) ... 236.3 Mbytes of cons cells used (68%) 1401.6 Mbytes of vectors used (67%) Garbage collection 4577 = 1839+573+2165 (level 2) ... 236.3 Mbytes of cons cells used (68%) 1401.7 Mbytes of vectors used (67%) Fehler: kann Vektor der Größe 723 KB nicht allozieren* – ddg Nov 16 '12 at 15:23
  • We are talking about one year data, so each deal has about 365 data points. – ddg Nov 16 '12 at 16:14
  • Then the usage of ffdfdply is probably not the cause of your memory overflow. There are 2 options, or you did something else, or you have a lot of factors in your deals ffdf. So can you provide the exact code you are running? –  Nov 16 '12 at 16:19
  • Mark that in ff, your character columns are always factors and all your factor levels are always in RAM. So if you have a lot of columns with a lot of factor levels, this might be the cause if your memory issue? –  Nov 16 '12 at 16:40
  • This is true, I have 20 columns with factors. How can I work aroud that? what do you mean with "Mark that in ff"? Thanks. – ddg Nov 16 '12 at 16:49
  • Do you need the columns which have an enormous amount of factor levels? If not, don't select them from your DB. Also have a look at the key function in package ffbase as it allows you to create integer keys (unique combinations of elements) which do not put the factor levels in RAM. This key can replace your factors with too many levels. For a further discussion, you can contact me by email if you like. –  Nov 16 '12 at 16:56
  • Quick update. I tried to limit the df to only 3 column (DealID, date, value) and I still get into trouble with the memory. – ddg Nov 16 '12 at 16:57
  • can you provide the result of str(deals.strom[1:5,]) so that we can see how many factor levels each column has? –  Nov 17 '12 at 09:03
  • Hi, I posted the results. I tried to run it with a smaller set of Deals (about 3500) and it is working without problems even on a large time window (I tested it over 400 days). It looks like it depends on the number of DEALIDs in the data.frame – ddg Nov 19 '12 at 11:50
  • Thanks for the str output. In the above proposed solution, inside FUN a split is done which will make a list of data.frames, each with the 59704 factor levels for DEALID which is probably the issue here causing your RAM overflow. Avoid it by doing as.character. I have updated the answer to show this. In fact, you didn't need to split the data for your case - I also added an extra possibility to get to your result with using split in FUN. –  Nov 19 '12 at 13:24
  • Hi, many thanks for your solution. It is working also I do not get into memory trouble even with the deals at once. Given the size of the data it is also fairly fast. Levereging on your hint I tried also an alternative solution using transform (see above). It would have the advantage to add more than one column at once but I get in trouble at the end when I try to subset using a boolen operator. It is not clear to me what I am doing wrong. – ddg Nov 21 '12 at 14:40
  • Hi, good to know that it worked. I think your usage of transform is another question, it would be better if you created a new stackoverflow item for it. –  Nov 21 '12 at 16:18