Split intervals (genomic regions) in individual numbers (nucleotides)

Question

I would like to transform my data frame df based in regions in point by point (number by number or nucletide by nucleotide) information.

My input df:

start  end  state  freq
 100   103   1nT    22
 100   103   3nT    34
 104   106   1nT    12
 104   106   3nT    16

My expected output:

position state freq
  100     1nT   22
  101     1nT   22
  102     1nT   22
  103     1nT   22
  100     3nT   34
  101     3nT   34
  102     3nT   34
  103     3nT   34
  104     1nT   12
  105     1nT   12
  106     1nT   12
  104     3nT   16
  105     3nT   16
  106     3nT   16

Any ideas? Thank you very much.

Not sure I understand the pattern here. Where's position `102`? — David Arenburg, Aug 20 '14 at 16:29

score 2 · Accepted Answer · answered Aug 20 '14 at 16:54

Here is a vectorized approach:

# load your data
df <- read.table(textConnection("start  end  state  freq
 100   103   1nT    22
 100   103   3nT    34
 104   106   1nT    12
 104   106   3nT    16"), header=TRUE)

# extract number of needed replications
n <- df$end - df$start + 1

# calculate position and replicate state/freq
res <- data.frame(position = rep(df$start - 1, n) + sequence(n),
                  state = rep(df$state, n),
                  freq = rep(df$freq, n))
res
#    position state freq
# 1       100   1nT   22
# 2       101   1nT   22
# 3       102   1nT   22
# 4       103   1nT   22
# 5       100   3nT   34
# 6       101   3nT   34
# 7       102   3nT   34
# 8       103   3nT   34
# 9       104   1nT   12
# 10      105   1nT   12
# 11      106   1nT   12
# 12      104   3nT   16
# 13      105   3nT   16
# 14      106   3nT   16

`sequence` is just `lapply` wrapper – David Arenburg Aug 20 '14 at 16:56 — David Arenburg, Aug 20 '14 at 16:56

Mike.Gahan · Answer 2 · 2014-08-20T16:45:36.060

Here is one approach....

Build you data

require(data.table)
fakedata <- data.table(start=c(100,100,104,104),
                       end=c(103,103,106,106),
                       state=c("1nT","3nT","1nT","3nT"),
                       freq=c(22,34,12,16))

Perform calculation

fakedata[ , dur := (end-start+1)]
outdata <- fakedata[ , lapply(.SD,function(x) rep(x,dur))]
outdata[ , position := (start-1)+1:.N, by=list(start,end,state)]

And the output

    start end state freq dur position
 1:   100 103   1nT   22   4      100
 2:   100 103   1nT   22   4      101
 3:   100 103   1nT   22   4      102
 4:   100 103   1nT   22   4      103
 5:   100 103   3nT   34   4      100
 6:   100 103   3nT   34   4      101
 7:   100 103   3nT   34   4      102
 8:   100 103   3nT   34   4      103
 9:   104 106   1nT   12   3      104
10:   104 106   1nT   12   3      105
11:   104 106   1nT   12   3      106
12:   104 106   3nT   16   3      104
13:   104 106   3nT   16   3      105
14:   104 106   3nT   16   3      106

score 1 · Answer 3 · answered Aug 20 '14 at 16:50

This can be accomplished with a simple apply command.

Let's build this in sequence:

You want to perform an operation based on every row, so apply by row should be your first thought (or for loop). So we know we want to use apply(data, 1, row.function).
Think of what you would want to do for a single row. You want to repeat state and freq for every number between start and stop. To get the range of numbers between start and stop we can use the colon operator start:stop. Now, R will automatically repeat the values in a vector to match the longest vector length when creating a data.frame. So, we can create the piece from a single row like this:
```
data.frame(position=(row['start']:row['end']), state=row['state'], freq=row['freq'])
```
Then we want to bind it all together, so we use `do.call('rbind', result).

Putting this all together now, we have:

do.call('rbind',        
  apply(data, 1, function(row) {
    data.frame(position=(row['start']:row['end']),
      state=row['state'], freq=row['freq'])
  }))

Which will give you what you want. Hopefully this helps teach you how to approach problems like this in the future too!

score 0 · Answer 4 · answered Aug 20 '14 at 16:35

Here's rough implementation using for loop.

    a = t(matrix(c(100, 103,  "1nT" ,   22,
    100,   103 ,  "3nT" ,   34,
    104,   106 ,  "1nT" ,   12,
    104,   106 ,  "3nT" ,   16), nrow = 4))
    a = data.frame(a, stringsAsFactor = F)

    colnames(a) = c("start",  "end" , "state",  "freq")
    a$start = as.numeric(as.character(a$start))
    a$end = as.numeric(as.character(a$end))

    n = dim(a)[1]
    res = NULL

    for (i in 1:n) {
      position = a$start[i]:a$end[i]
      state = rep(a$state[i], length(position))
      freq = rep(a$freq[i], length(position))
      temp = cbind.data.frame(position, state, freq)
      res = rbind(res, temp)
    }

Split intervals (genomic regions) in individual numbers (nucleotides)

4 Answers4