1

Taking my first question as starting point: Split data frame into multiple data frames based on information in a xts object

Now I have a new problem.

Suppose you have double entries in your data.frames d1 & d2 - i.e. you have in d1 the letter "h" in grp B & A and therefore also two time series for "a" in d2. How can we solve this?

d1 <- data.frame(grp=sample(LETTERS[1:4], 11, replace=TRUE),
                 name=letters[c(8,1:10)])
> d1
grp name
1    B    h
2    D    a
3    B    b
4    D    c
5    B    d
6    C    e
7    A    f
8    A    g
9    A    h
10   B    i
11   C    j

d2 <- matrix(round(runif(55), 2), ncol=11)
colnames(d2) <- letters[c(8,1:10)]
library(xts)
d2 <- xts(d2, seq.Date(as.Date('01-01-2011', '%d-%m-%Y'), 
                       as.Date('5-01-2011', '%d-%m-%Y'), 1))

> d2
              h    a    b    c    d    e    f    g    h    i    j
2011-01-01 0.04 0.77 0.49 0.87 0.23 0.95 0.69 0.35 0.14 0.47 0.25
2011-01-02 0.73 0.46 0.28 0.86 0.75 0.08 0.00 0.89 0.50 0.12 0.54
2011-01-03 0.36 0.61 0.92 0.80 0.12 0.25 0.18 0.44 0.73 0.19 0.30
2011-01-04 0.18 0.65 0.68 0.44 0.54 0.84 0.13 0.64 0.54 0.81 0.73
2011-01-05 0.58 0.55 0.10 0.33 0.55 0.23 0.82 0.21 0.58 0.24 0.04

This does not work:

out <- setNames(sapply(unique(d1$grp), function(x) {
  d2[, which(d1$grp[match(colnames(d2), d1$name)] == x)]
}), unique(d1$grp))

out

$B
              h    b    d    h    i
2011-01-01 0.04 0.49 0.23 0.14 0.47
2011-01-02 0.73 0.28 0.75 0.50 0.12
2011-01-03 0.36 0.92 0.12 0.73 0.19
2011-01-04 0.18 0.68 0.54 0.54 0.81
2011-01-05 0.58 0.10 0.55 0.58 0.24

$D
              a    c
2011-01-01 0.77 0.87
2011-01-02 0.46 0.86
2011-01-03 0.61 0.80
2011-01-04 0.65 0.44
2011-01-05 0.55 0.33

$C
              e    j
2011-01-01 0.95 0.25
2011-01-02 0.08 0.54
2011-01-03 0.25 0.30
2011-01-04 0.84 0.73
2011-01-05 0.23 0.04

$A
              f    g
2011-01-01 0.69 0.35
2011-01-02 0.00 0.89
2011-01-03 0.18 0.44
2011-01-04 0.13 0.64
2011-01-05 0.82 0.21

Expected output should be:

out

$B
              h    b    d    i
2011-01-01 0.04 0.49 0.23 0.47
2011-01-02 0.73 0.28 0.75 0.12
2011-01-03 0.36 0.92 0.12 0.19
2011-01-04 0.18 0.68 0.54 0.81
2011-01-05 0.58 0.10 0.55 0.24

$D
              a    c
2011-01-01 0.77 0.87
2011-01-02 0.46 0.86
2011-01-03 0.61 0.80
2011-01-04 0.65 0.44
2011-01-05 0.55 0.33

$C
              e    j
2011-01-01 0.95 0.25
2011-01-02 0.08 0.54
2011-01-03 0.25 0.30
2011-01-04 0.84 0.73
2011-01-05 0.23 0.04

$A
              f    g    h
2011-01-01 0.69 0.35 0.14
2011-01-02 0.00 0.89 0.50
2011-01-03 0.18 0.44 0.73
2011-01-04 0.13 0.64 0.54
2011-01-05 0.82 0.21 0.58

Help is much appreciated!

Thank you in advance...

Community
  • 1
  • 1
Daniel_D
  • 698
  • 1
  • 7
  • 13
  • Hi akrun, I have updated my question showing the expected output. Look at group $A. Here you can see the second time series h. Thanks for your time! – Daniel_D Nov 12 '14 at 14:50
  • Does it have to be in a particular order ie. the list elements? – akrun Nov 12 '14 at 14:51
  • Hi akrun, unfortunateyl the order is of importance because suppose you have different weights for h. Let's say in grp B you have a weight of 0.3 and in grp A 0.5. Thanks – Daniel_D Nov 12 '14 at 14:54
  • If I understand correctly, the first `h` column should go into `B` based on the order in which they appear in `d1` and the second to `A` – akrun Nov 12 '14 at 14:59
  • Please check if the update works. i.e. `colnames(x1) <- gsub("\\d+$", '', colnames(x1))` – akrun Nov 12 '14 at 16:07
  • See my comment below. Thank you very much! – Daniel_D Nov 12 '14 at 16:15
  • I created a delimiter during the `paste` step. Now, it will be easier to remove after that delimiter as shown in the update. – akrun Nov 12 '14 at 16:32

1 Answers1

0

You could do:

  name1 <- with(d1,paste0(name,"#", ave(name, name,
                                    FUN=seq_along)))
  colnames(d2) <- name1 #assuming that it is in the same order as in the example

  res <- lapply(split(name1, d1$grp), function(x) {
                 x1 <- d2[, as.character(x), drop=FALSE]
                 colnames(x1) <- gsub("\\#.*$", '', colnames(x1))
                 x1})

  res1 <- res[match(unique(d1$grp), names(res))]

  res1
  # $B
  #            h    b    d    i
  #2011-01-01 0.04 0.49 0.23 0.47
  #2011-01-02 0.73 0.28 0.75 0.12
  #2011-01-03 0.36 0.92 0.12 0.19
  #2011-01-04 0.18 0.68 0.54 0.81
  #2011-01-05 0.58 0.10 0.55 0.24

  # $D
  #            a    c
  #2011-01-01 0.77 0.87
  #2011-01-02 0.46 0.86
  #2011-01-03 0.61 0.80
  #2011-01-04 0.65 0.44
  #2011-01-05 0.55 0.33

  # $C
  #            e    j
  #2011-01-01 0.95 0.25
  #2011-01-02 0.08 0.54
  #2011-01-03 0.25 0.30
  #2011-01-04 0.84 0.73
  #2011-01-05 0.23 0.04

  # $A
  #            f    g    h
  #2011-01-01 0.69 0.35 0.14
  #2011-01-02 0.00 0.89 0.50
  #2011-01-03 0.18 0.44 0.73
  #2011-01-04 0.13 0.64 0.54
  #2011-01-05 0.82 0.21 0.58

data

  d1  <- structure(list(grp = c("B", "D", "B", "D", "B", "C", "A", "A", 
  "A", "B", "C"), name = c("h", "a", "b", "c", "d", "e", "f", "g", 
  "h", "i", "j")), .Names = c("grp", "name"), class = "data.frame", row.names = 
  c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11"))


  d2 <- structure(c(0.04, 0.73, 0.36, 0.18, 0.58, 0.77, 0.46, 0.61, 0.65, 
  0.55, 0.49, 0.28, 0.92, 0.68, 0.1, 0.87, 0.86, 0.8, 0.44, 0.33, 
  0.23, 0.75, 0.12, 0.54, 0.55, 0.95, 0.08, 0.25, 0.84, 0.23, 0.69, 
  0, 0.18, 0.13, 0.82, 0.35, 0.89, 0.44, 0.64, 0.21, 0.14, 0.5, 
  0.73, 0.54, 0.58, 0.47, 0.12, 0.19, 0.81, 0.24, 0.25, 0.54, 0.3, 
  0.73, 0.04), .Dim = c(5L, 11L), .Dimnames = list(NULL, c("h", 
  "a", "b", "c", "d", "e", "f", "g", "h", "i", "j")), index = 
  structure(c(1293840000, 1293926400, 1294012800, 1294099200, 1294185600), 
  tzone = "UTC", tclass = "Date"), .indexCLASS = "Date", tclass = "Date",
   .indexTZ = "UTC", tzone = "UTC", class = c("xts", "zoo"))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Hi akrun, in your example you can see that the time series for h in $A and $B is the same. A closer look into d2 gives you different time series for h. Thanks – Daniel_D Nov 12 '14 at 14:56
  • @Daniel_D Thanks, I didn't check it. – akrun Nov 12 '14 at 14:58
  • Thank you very much!!! It works fine! Just one drawback. Your function deletes all numbers in the colnames. My real data contains security names i.e bonds with maturity in the name. Is it possible to delete only the last number? The idea is great of the sequence at the end. Thank you! – Daniel_D Nov 12 '14 at 16:02
  • We are close. Suppose i have the security name: 3MO EURO EURIBOR Dec16. With the sequence you get 3MO EURO EURIBOR Dec161. After the function the name is 3MO EURO EURIBOR Dec – Daniel_D Nov 12 '14 at 16:13
  • If its to complicated for you. I am fine with that. Your are a genius anywhay!! – Daniel_D Nov 12 '14 at 16:14
  • @Daniel_D Do you have any other patterns in the column names so that we can sort it out all at once? Suppose, you have `Jan6`, is it `Jan06` or `Jan6`? – akrun Nov 12 '14 at 16:19
  • It is 3MO EURO EURIBOR Dec16 or SPABOL 2 3/4 02/01/19. Thanks for your patience – Daniel_D Nov 12 '14 at 16:21
  • @Daniel_D Do you know how many groups we are adding? For example in the current example, at the max it is 2 for `h`. Will it exceed `9`? – akrun Nov 12 '14 at 16:22
  • @Daniel_D Suppose `v1 <- c('3MO EURO EURIBOR Dec161', '3MO EURO EURIBOR Jan061'); gsub('\\d$','', v1) #[1] "3MO EURO EURIBOR Dec16" "3MO EURO EURIBOR Jan06"` – akrun Nov 12 '14 at 16:26
  • This is possible. Just to let you know. My database is a portfolio with securities. And I need to divide my portfolio in different strategies (Cluster). The number of strategies is not limited. It can be 3, 5 or 20 and so on. And you can have the same security in several Clusters. Hope this is comprehensible – Daniel_D Nov 12 '14 at 16:27
  • @Daniel_D We can do another thing which could be more robust. Please check my update soon. – akrun Nov 12 '14 at 16:30
  • It works fine now. Thank you very much! Do you see any problem if it exceeds 9? For the different groups I mean. – Daniel_D Nov 12 '14 at 16:31
  • @Daniel_D I was just asking because if it is something within `1-9`, we could use `gsub('.$','', colnames(x1))`. Anyway, I created a new delimiter, which would resolve the problem even if you have more groups. – akrun Nov 12 '14 at 16:33
  • You are my hero! Thank you very much! Everything works fine and the names are ok too! Much appreciated! Have a nice day... – Daniel_D Nov 12 '14 at 16:36