2

So here's the situation. I've got an 85 Million row table with 18 columns. Three of these columns have values in Metric Prefix / SI notation (See Metric Prefix on Wikipedia).

This means I have number like :

  • .1M instead of 100000 or 1e+5, or
  • 1K instead of 1000 or 1e+3

Sample data.table is

          V1     V2   V3  V4  V5  V6 V7 V8 V9  V10 V11 V12 V13 V14 V15 V16 V17 V18
 1: 2014-03-25 12:15:12 58300 3010 44.0  4.5 0.0   0   0  0.8  50 0.8 10K 303 21K   0     a   56
 2: 2014-03-25 12:15:12 56328 3010 28.0 12.0 0.0   0   0  0.3  60 0.0  59  62 .1M   0     a   66
 3: 2014-03-25 12:15:12 21082 3010 10.0  1.7 0.0   0   0 14.0  72 0.3  4K 208  8K   1     a   80
 4: 2014-03-25 12:15:12 59423 3010 12.0  0.0 0.2   0   0 88.0   0 0.0  20  16  71   0     a   26
 5: 2014-03-25 12:15:12 59423 3010  9.6  1.4 0.0   0   0 60.0  29 0.2  2K 251  6K   0     a   56
 6: 2014-03-25 12:15:12 24193 3010  8.3  1.9 0.0   0   0  9.9  80 0.3  3K 264  8K   1     a   71
 7: 2014-03-25 12:15:12 21082 3010  7.1  1.7 0.4   0   0  6.3  83 0.3  3K 197  7K   0     a   71
 8: 2014-03-25 12:15:12 59423 3010  4.6  1.2 0.0   0   0 57.0  37 0.1 998  81  7K   0     a  118

I modified a function written by Hans-Jörg Bibiko who used it to modify ggplot2 scales. See website here if you are iterested. The function I ended up using is :

sitor <- function(x)
{
  conv <- paste("E", c(seq(-24 ,-3, by=3), -2, -1, 0, seq(3, 24, by=3)), sep="")
  names(conv) <- c("y","z","a","f","p","n","µ","m","c","d","","K","M","G","T","P","E","Z","Y")
  x <- as.character(x)
  num <- function(x) as.numeric(
      paste(
        strsplit(x,"[A-z|µ]")[[1]][3],
        ifelse(substr(paste(strsplit(x,"[0-9|\\.]")[[1]], sep="", collapse=""), 1, 1) == "",
               "",
               conv[substr(paste(strsplit(x,"[0-9|\\.]")[[1]], sep="", collapse=""), 1, 1)]
        ),
        sep=""
      )
    )
  return(lapply(x,num))
}

I apply it to by data table to update 3 columns like

temp[ ,`:=`(V13=sitor(V13),V14=sitor(V14),V15=sitor(V15)) ]

I have applied a data.table key vector to the temp table with

setkeyv(temp,c("V1","V2","V3","V18"))

Any 61 minutes later I am still here waiting for a result... Some tips on how to speed up this conversion would be really handy given that my data size is about to grow 4 to 5 times.

neurozen
  • 1,087
  • 13
  • 19
  • Output from top : `PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND` `4878 neurozen 20 0 18.7g 18g 11m R 100.1 62.8 63:38.95 rsession` – neurozen Mar 30 '14 at 11:48
  • What is running 61 minutes? `setkeyv(temp,c("V1","V2","V3","V18"))` or `temp[ ,`:=`(V13=sitor(V13),V14=sitor(V14),V15=sitor(V15)) ]`? Why are you trying to sort temp? – David Arenburg Mar 30 '14 at 12:00
  • `sitor` returns a list... are your column in dt of type list? – Michele Mar 30 '14 at 12:04
  • @Dave Applying the `sitor` function to `temp` with temp[ ,:=(V13=sitor(V13),V14=sitor(V14),V15=sitor(V15)) ] is taking 61 minutes (it is actually still going). I'm not intending to sort temp, it just will be after the key is applied. `setkeyv` on`temp` takes ~90 seconds. – neurozen Mar 30 '14 at 12:24
  • @Michele Good point. The class of the columns I am applying `sitor` are `character` class. Are you suggesting that I shouldn't be doing the `lapply` in teh return value for the function? Tell me more... – neurozen Mar 30 '14 at 12:27
  • you can do either change it to `sapply` or leave it and wrap with `unlist` (the latter is faster) – Michele Mar 30 '14 at 12:32
  • @Michele Excellent point. I'll give that a go and see how it runs. – neurozen Mar 30 '14 at 12:42
  • I tried a subset of 1 million rows and timed it with `system.time` and got `user = 186.559 system = 0.067 elapsed = 186.804` So at this rate (assuming linear) it should take ~ 4hours 24 mins. yikes! – neurozen Mar 30 '14 at 13:07
  • As a note, you could move `strsplit`s and `substr`s outside of your `num` function, since the first "strsplits" a vector iteratively (returning a list), and the second "substrs" a list iteratively. Then, you'd only have to `lapply` a modified `num` function which -basically- contains a `paste`. Depending on your output, you could `unlist` and use `ifelse` outside of `num`, too. – alexis_laz Mar 30 '14 at 13:44
  • @Michele Applying your `si2f` across 3 columns of the data table for 83.7 million rows took 268.888 seconds! – neurozen Mar 31 '14 at 01:52

2 Answers2

2

Why don't you try sitools library?

library(data.table)
dt<-data.table(var = sample(x=1:1e5, size=1e6, replace=T))
library(sitools)
> system.time(dt[, var2 := f2si(var)])
   user  system elapsed 
  10.08    0.09   10.89

EDIT: this is a data.table based function that reverse f2si from sitools package:

si2f<-function(x){
  if(is.numeric(x)) return(x)
  require(data.table)
  dt<-data.table(lab=c("y","z","a","f","p","n","µ","m","c","d","", "da", "h", "k","M","G","T","P","E","Z","Y"),
                 mul=c(1e-24, 1e-21, 1e-18, 1e-15, 1e-12, 1e-9, 1e-6, 1e-3, 1e-2, 1e-1, 1L, 10L, 1e2, 1e3, 1e6, 1e9, 1e12, 1e15, 1e18, 1e21, 1e24),
                 key="lab")
  res<-as.numeric(gsub("[^0-9|\\.]","", x))
  x<-gsub("[0-9]|\\s+|\\.","", x)
  .subset2(dt[.(x)], "mul")*res
}

> system.time(dt[, var3 := si2f(var2)])
   user  system elapsed 
  13.18    0.03   13.31 

> dt[, all.equal(var,var3)]
[1] TRUE
Michele
  • 8,563
  • 6
  • 45
  • 72
  • Wow that's awesome. I'm going to give it a go! – neurozen Mar 30 '14 at 13:36
  • I couldn't install the scitools package for my R version :` ‘scitools’ is not available (for R version 3.0.3)` – neurozen Mar 30 '14 at 13:40
  • @neurozen it's `sitools` – Michele Mar 30 '14 at 13:41
  • @Michelle Thanks heaps. I've been up too long. – neurozen Mar 30 '14 at 13:47
  • The `f2sci` function is great but I actually want to do the reverse. As my sample data above, it has character units like K and M for kilo (1e3) and M for (1e6) in it and I want it as a number (with exponent or otherwise) – neurozen Mar 30 '14 at 13:57
  • @neurozen oh I see. I saw that you applied your function to V13, V14 and V15 which seem to be just numbers so I thought you wanted this feature – Michele Mar 30 '14 at 14:06
  • @Michelle I like it! I am going to give it a go now and I will let you know the result. – neurozen Mar 30 '14 at 23:11
  • @neurozen please mind I didn't use the capitol `K` but the small – Michele Mar 30 '14 at 23:13
  • Applying your si2f across 3 columns of the data table for 83.7 million rows took 268.888 seconds! – neurozen Mar 31 '14 at 01:58
  • Michele I'll have to give you the tick for the answer as you come up with the key lookup concept first. @Data Munger I wish I could give give you a tick as well! – neurozen Mar 31 '14 at 02:05
1

Here is an approach that takes about 10 seconds on my computer to covert a vector with 10M values. You can extent it to cover more than "K", "M" & "G"

> f_conv <- function(val){
+     # create matrix indexed by name for exponent
+     key <- c(Zero = ""
+          , K = "E3"
+          , M = "E6"
+          , G = "E9"
+          )
+     # extract where the original exponent is 
+     indx <- regexpr("[KMG]", val)
+     # extract the exponent
+     exp <- substring(val, indx)
+     # if there was none, the use "Zero"
+     exp[indx == -1L] <- "Zero"
+     # put fake length
+     indx[indx == -1L] <- 20L
+     # do the conversion
+     as.numeric(paste0(substring(val, 1L, indx - 1L)
+                  , key[exp]
+                  )
+              )
+ }
> 
> # test data
> n <- 10000000
> result <- paste0(sample(1:999, n, TRUE)
+             , sample(c("K", "M", "G", ""), n, TRUE)
+             )
> 
> system.time(x <- f_conv(result))
   user  system elapsed 
   8.48    0.13    8.63 
> cbind(result[1:50], x[1:50])
      [,1]   [,2]          
 [1,] "562K" "562000"      
 [2,] "946"  "946"         
 [3,] "313G" "313000000000"
 [4,] "538M" "538000000"   
 [5,] "697K" "697000"      
 [6,] "486G" "486000000000"
 [7,] "814G" "814000000000"
 [8,] "842"  "842"         
 [9,] "993M" "993000000"   
[10,] "440K" "440000"      
[11,] "435G" "435000000000"
[12,] "407M" "407000000"   
[13,] "919K" "919000"      
[14,] "840"  "840"         
[15,] "766G" "766000000000"
[16,] "977"  "977"         
[17,] "139"  "139"         
[18,] "195G" "195000000000"
[19,] "609M" "609000000"   
[20,] "69"   "69"          
[21,] "147M" "147000000"   
[22,] "104M" "104000000"   
[23,] "509K" "509000"      
[24,] "951M" "951000000"   
[25,] "278"  "278"         
[26,] "797G" "797000000000"
[27,] "106K" "106000"      
[28,] "667K" "667000"      
[29,] "521K" "521000"      
[30,] "9"    "9"           
[31,] "17K"  "17000"       
[32,] "673M" "673000000"   
  • I will give it a go. @Michelle @Data Munger FYI I also had a thought to use `grep` to select the rows which had metric units in them only temp[grep("[KMG]$",V1), V1:=sitor(V1)]). The catch was you have to return character then afterwards run as.numeric across the column again. – neurozen Mar 30 '14 at 23:08
  • I gave it a go and it works just fine. It took 237.6 seconds to run against 3 columns of a data table with 83.7 million rows. Thank you! – neurozen Mar 31 '14 at 02:01