24

I have a data frame with a numerical ID variable which identify the Primary, Secondary and Ultimate Sampling Units from a multistage sampling scheme. I want to split the original ID variable into three new variables, identifying the different sampling units separately:

Example:

>df[1:2,]
ID Var        var1     var2      var3     var4         var5  
501901          9    SP.1          1        W         12.10    
501901          9    SP.1          2        W         17.68  

What I want:

>df[1:2,]
ID1    ID2     ID3   var1   var2  var3     var4    var5  
5      01      901    9    SP.1    1        W     12.10    
5      01      901    9    SP.1    2        W     17.68  

I know there is some functions available in R to split character strings, but I could not find same facilities for numbers.

Thank you,

Juan

jrs-x
  • 336
  • 1
  • 2
  • 10
  • 4
    why don't you try convert your id to string with `as.character()` then to use `strsplit()` and then back to numbers with `as.numeric()` ? – user974514 Mar 19 '13 at 11:35
  • Yes, It could be that way, but I was looking for a function which could be directly used on numeric. Thank you anyway. – jrs-x Mar 19 '13 at 14:20

7 Answers7

21

You could use for example use substring:

df <- data.frame(ID = c(501901, 501902))

splitted <- t(sapply(df$ID, function(x) substring(x, first=c(1,2,4), last=c(1,3,6))))
cbind(df, splitted)
#      ID 1  2   3
#1 501901 5 01 901
#2 501902 5 01 902
EDi
  • 13,160
  • 2
  • 48
  • 57
13

Yet another alternative is to re-read the first column using read.fwf and specify the widths:

cbind(read.fwf(file = textConnection(as.character(df[, 1])), 
               widths = c(1, 2, 3), colClasses = "character", 
               col.names = c("ID1", "ID2", "ID3")), 
      df[-1])
#   ID1 ID2 ID3 var1 var2 var3 var4  var5
# 1   5  01 901    9 SP.1    1    W 12.10
# 2   5  01 901    9 SP.1    2    W 17.68

One advantage here is being able to set the resulting column names in a convenient manner, and ensure that the columns are characters, thus retaining any leading zeroes that might be present.

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
5

This should work:

df <- cbind(do.call(rbind, strsplit(gsub('(.)(..)(...)', '\\1 \\2 \\3', paste(df[,1])),' ')), df[,-1]) # You need that paste() there because gsub() works only with text.

Or with substr()

df <- cbind(ID1=substr(df[, 1],1,1), ID2=substr(df[, 1],2,3), ID3=substr(df[, 1],4,6), df[, -1])
Rcoster
  • 3,170
  • 2
  • 16
  • 35
4

Since they are numbers, you will have to do some math to extract the digits you want. A number represented in radix-10 can be written as:

d0*10^0 + d1*10^1 + d2*10^2 ... etc. where d0..dn are the digits of the number.

Thus, to extract the most significant digit from a 6-digit number which is mathematically represented as:

number = d5*10^5 + d4*10^4 + d3*10^3 + d2*10^2 + d1*10^1 + d0*10^0

As you can see, dividing this number by 10^5 will get you:

number / 10^5 = d5*10^0 + d4*10^(-1) + d3*10^(-2) + d2*10^(-3) + d1*10^(-4) + d0*10^(-5)

Voila! Now you have extracted the most significant digit if you interpret the result as an integer, because all the other digits now have a weight less than 0 and thus are smaller than 1. You can do similar things for extracting the other digits. For digits in least significant position you can do modulo operation instead of division.

Examples:

501901 / 10^5 = 5 // first digit
501901 % 10^5 = 1 // last digit
(501901 / 10^4) % 10^1 = 0 // second digit
(501901 / 10^2) % 10^2 = 19 // third and fourth digit
fvannee
  • 762
  • 4
  • 10
4

Several neat answers have been made years ago, but a solution I find useful, using the outer function, has not been mentioned. In this age of search engines, I put it here in case others could find it handy.

I was faced with a slightly simpler problem: turning a column of 6-digit numbers into 6 columns representing each digit. This can be solved using a combination of outer, integer division (%/%) and modulo (%%).

 DF <- data.frame("ID" = runif(3)*10^6, "a" = sample(letters, 3,T))
 DF <- cbind(DF, "ID" = outer(DF$ID, 10^c(5:0), function(a, b) a %/% b %% 10))
 DF
#       ID a ID.1 ID.2 ID.3 ID.4 ID.5 ID.6
# 1 814895 z    8    1    4    8    9    5
# 2 417209 q    4    1    7    2    0    9
# 3 545797 c    5    4    5    7    9    7

The question asked here is slightly more complex, requiring different values for both integer division and modulo.

 DF <- data.frame("ID" = runif(3)*10^6, "a" = sample(letters, 3,T))
 DF <- cbind(DF, "ID" = outer(DF$ID, c(1:3), function(a,b) a %/% 10^c(5,3,0)[b] %% 10^b))
 DF
#      ID a ID.1 ID.2 ID.3
# 1 809372 q    8    9  372
# 2 954790 g    9   54  789
# 3 166970 l    1   66  969
3

If you don't want to convert to character for some reason, following is one of the way to achieve what you want

DF <- data.frame(ID = c(501901, 501902), var1 = c("a", "b"), var2 = c("c", "d"))

result <- t(sapply(DF$ID, function(y) {
    c(y%/%1e+05, (y - y%/%1e+05 * 1e+05)%/%1000, y - y%/%1000 * 1000)
}))


DF <- cbind(result, DF[, -1])

names(DF)[1:3] <- c("ID1", "ID2", "ID3")

DF
##   ID1 ID2 ID3 var1 var2
## 1   5   1 901    a    c
## 2   5   1 902    b    d
CHP
  • 16,981
  • 4
  • 38
  • 57
  • ok, Got it. Today I have to say that I learnt all what I need in the present and the next future on how to split a number!! ;-) – jrs-x Mar 19 '13 at 14:52
2

With so many answers it felt like I needed to come up with something :)

library(qdap)
x <- colSplit(dat$ID_Var, col.sep="")
data.frame(ID1=x[, 1], ID2=paste2(x[, 2:3], sep=""), 
    ID3=paste2(x[, 4:6],sep=""), dat[, -1])

##   ID1 ID2 ID3 var1 var2 var3 var4  var5
## 1   5  01 901    9 SP.1    1    W 12.10
## 2   5  01 901    9 SP.1    2    W 17.68
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519