5

I have a csv/log file of 35=S (Quote messages; "Tag=Value") and I need to extract the rates into a proper CSV file for data mining. This is not strictly FIX related, it's more of a R related question on how to clean a dataset.

The raw messages look something like this:

190=1.1204 ,191=-0.000029,193=20141008,537=0        ,631=1.12029575,642=0.000145,10=56
190=7.20425,191=0.000141 ,537=0       ,631=7.2034485,10=140        ,            ,
190=1.26237,191=0        ,537=1       ,10=068       ,              ,            ,

I need first to get to an intermediate data set that looks like this, where the same tags are aligned.

190=1.1204 ,191=-0.000029,193=20141008,537=0,631=1.12029575,642=0.000145,10=56
190=7.20425,191=0.000141 ,            ,537=0,631=7.2034485 ,            ,10=140
190=1.26237,191=0        ,            ,537=1,              ,            ,10=068

which in turn will need to be converted to this:

190    ,191      ,193     ,537,631       ,642     ,10
1.1204 ,-0.000029,20141008,0  ,1.12029575,0.000145,56
7.20425,0.000141 ,        ,0  ,7.2034485 ,        ,140
1.26237,0        ,        ,1  ,          ,        ,068

I'm in the midst of developing a bash script with awk but I wonder if I can do that in R. At present, my greatest challenge is arriving to the intermediate table. From the intermediate to the final table I thought of using the R with the tidyr package, specifically function 'separate'. If anybody can suggest a better logic, I'll greatly appreciate!

Henrik
  • 65,555
  • 14
  • 143
  • 159
fabiog1901
  • 352
  • 3
  • 12
  • Can you clarify how extensive these records are? How many columns and rows of CSV data? – vpipkt Oct 28 '14 at 14:06
  • 1
    10000 rows and columns might be over 20, depends on the type of deal. This above is just a snippet to get the point across – fabiog1901 Oct 28 '14 at 14:08

2 Answers2

4

EDITED. Full solution using base R functions only:

dat <- scan(sep=",", what="character", text="190=1.1204 ,191=-0.000029,193=20141008,537=0        ,631=1.12029575,642=0.000145,10=56
190=7.20425,191=0.000141 ,537=0       ,631=7.2034485,10=140        ,            ,
190=1.26237,191=0        ,537=1       ,10=068       ,              ,            ,")

dat <- gsub(" ", "", dat)
dat <- dat[dat != ""]

x <- as.data.frame(
  matrix(
    unlist(
      sapply(dat, strsplit, split = "=", USE.NAMES=FALSE)
    ),
    ncol=2, byrow=TRUE
  )
)

z <- unstack(x, V2 ~ V1)

The resulting object is a named list that is close to what you wanted. You will have to do some additional work to convert this into matrix, if required.

$`10`
[1] "56"  "140" "068"

$`190`
[1] "1.1204"  "7.20425" "1.26237"

$`191`
[1] "-0.000029" "0.000141"  "0"   

....
etc.     

From here, you simply have to pad the list with the appropriate number of NA values:

maxLength <- max(sapply(z, length))
sapply(z, function(x)c(as.numeric(x), rep(NA, maxLength - length(x))))

Gives:

      10     190       191      193 537      631      642
[1,]  56 1.12040 -0.000029 20141008   0 1.120296 0.000145
[2,] 140 7.20425  0.000141       NA   0 7.203449       NA
[3,]  68 1.26237  0.000000       NA   1       NA       NA
Andrie
  • 176,377
  • 47
  • 447
  • 496
  • thanks. I'm not sure this works yet, I'd expect to see 3 variable for each tag, say, $`642` [1] "0.000145", "", "" – fabiog1901 Oct 28 '14 at 14:51
  • 1
    Correct, that's why I say you have to you'll have to do some additional work to get into matrix, mainly by adding NAs as appropriate. – Andrie Oct 28 '14 at 14:57
  • sadly, that crashes the system. There are 10000 rows and about 20 fields, and R crashes. I might have to go the bash/awk route. – fabiog1901 Oct 28 '14 at 19:20
  • @fabiog might as well use quickfix if you're open to using other software. – GSee Oct 28 '14 at 19:31
  • @GSee yeah all that log is created by QF/J, but I don't have access to the app itself, only to the logs. I wish I had, that would be problem solved! – fabiog1901 Oct 28 '14 at 19:55
  • @Andrie, Nice `base` answer (+1)! I was just wondering if you need `sapply`? `identical(sapply(dat, strsplit, split = "=", USE.NAMES=FALSE), strsplit(dat, split = "="))` – Henrik Oct 29 '14 at 08:27
4

Another possibility. Start with same scan as @Andrie, but also use arguments strip.white and na.strings:

x <- scan(text = "190=1.1204 ,191=-0.000029,193=20141008,537=0        ,631=1.12029575,642=0.000145,10=56
190=7.20425,191=0.000141 ,537=0       ,631=7.2034485,10=140        ,            ,
190=1.26237,191=0        ,537=1       ,10=068       ,              ,            ,",
           sep = ",",
           what = "character", 
           strip.white = TRUE,
           na.strings = "")

# remove NA
x <- x[!is.na(x)]

Then use colsplit and dcast from reshape2package:

library(reshape2)

# split 'x' into two columns
d1 <- colsplit(string = x, pattern = "=", names = c("x", "y")) 

# create an id variable, needed in dcast
d1$id <- ave(d1$x, d1$x, FUN = seq_along)   

# reshape from long to wide
d2 <- dcast(data = d1, id ~ x, value.var = "y")

#   id  10     190       191      193 537      631      642
# 1  1  56 1.12040 -0.000029 20141008   0 1.120296 0.000145
# 2  2 140 7.20425  0.000141       NA   0 7.203449       NA
# 3  3  68 1.26237  0.000000       NA   1       NA       NA

Because you mentioned tidyr:

library(tidyr)
d1 <- separate(data = data.frame(x), col = x, into = c("x", "y"), sep = "=")
d1$id <- ave(d1$x, d1$x, FUN = seq_along)
spread(data = d1, key = x, value = y)
#   id  10     190       191      193 537        631      642
# 1  1  56  1.1204 -0.000029 20141008   0 1.12029575 0.000145
# 2  2 140 7.20425  0.000141     <NA>   0  7.2034485     <NA>
# 3  3 068 1.26237         0     <NA>   1       <NA>     <NA>

This retains the values as character. If you want numeric, you can set convert = TRUE in spread.

Henrik
  • 65,555
  • 14
  • 143
  • 159
  • Wow, that looks interesting, will test and revert! I knew R guys had seen this question before, thanks Henrik! – fabiog1901 Oct 28 '14 at 14:55
  • @Henrik Could I ask a favour? Would you explain how `ave(d1$x, d1$x, FUN = seq_along)` creates the grouping variable? I would like to know how this line works. – jazzurro Oct 28 '14 at 15:20
  • @jazzurro, From `?ave`: "Subsets of x[] are averaged, where each subset consist of those observations with the same factor levels.", with the default `FUN`, which is `mean`. In this case it could be described as "apply a function `FUN`, here `FUN = seq_along`, on Subsets of `d1$x`, where each subset consist of those observations with the same factor levels, as defined by the grouping variable `d1$x`". `ave` index the result according to the order of the original vector. – Henrik Oct 28 '14 at 16:06
  • @Henrik Thank you very much for the explanation. I shuffled `d1` and checked the behaviour of the `ave` line. Now I see your point. Since the order of id sequence was perfect (seven 1s, five 2s, and four 3s) I could not see what was going on; I thought the ave line was assigning something like a group variable. Silly of me. Thank you for taking your time. – jazzurro Oct 28 '14 at 16:24