1

I have the following character string:

str(seqN)  
chr [1:704] "010000100100001010000100010001000100000100101000010001001000001001001000001000010010000100100100010000101000010"| __truncated__ ...

Yes they are very long strings (704 strings of length 1000) composed of 0s and 1s. They are meant to be a sequence already one-hot encoded.

Since I want to feed that to a Convolutional model, I need a certain input shape, so I want to split each string into subgroups of length 4 (to match the one-hot encoding).

The problem is that R doesn't let me split that string, as if the string was unsplittable.

For example, If I execute this code:

seqN2 <- array_reshape(seqN,c(704,250,4))

It gives me this error:

Error in py_call_impl(callable, dots$args, dots$keywords) :
ValueError: cannot reshape array of size 704 into shape (704,250,4)

What should I do to achieve that shape I need (704,250,4)?

3 Answers3

4

We can use strsplit from base R

lst1 <- strsplit(seqN, "(?<=.{4})", perl = TRUE)

The output will be a list of vectors. Not sure about the conversion to numeric. May be

lst2 <- lapply(lst1, strtoi, base = 2)

Or as OP mentioned in the comments, it is convert just to integer

lst2 <- lapply(lst1, as.integer)

If they are of the same length, it could be also converted to a matrix by rbinding the list elements

out <- do.call(rbind, lst2)
akrun
  • 874,273
  • 37
  • 540
  • 662
  • That does the trick yes, unfortunately it works as a list of substrings and I'm struggling to convert those substrings to integers to feed the model. – Jaime M. Legaz Jul 05 '19 at 14:56
  • @JaimeMartínez It would be a `list` of vectors. if you can show what your intended output would be, it may be useful. `do.call(rbind, strsplit(seqN, "(?<=.{4})", perl = TRUE))` converts to a matrix – akrun Jul 05 '19 at 14:59
  • that works better, yes. With your tip, after reshaping, I've got a matrix of dimensions (704, 250). The desired outpout should be (704, 250, 4), I'll try splitting in substrings of only one character (instead of 4) and then reshaping the matrix. – Jaime M. Legaz Jul 05 '19 at 15:07
  • I've seen your edit @akrun. I did the conversion to numeric simply with as.integer. Then I did a reshape of that array. – Jaime M. Legaz Jul 05 '19 at 15:11
  • @JaimeMartínez I was thinking that the `"0100"` or `"0010"` when converted with `as.integer`, it will become only 1. So, I used `strtroi` – akrun Jul 05 '19 at 16:13
  • Oh I see. In the end it just removed the zeroes to the left, so "0100" turned to 100. – Jaime M. Legaz Jul 05 '19 at 16:46
  • Ok, in that case, it is just `as.integer` – akrun Jul 05 '19 at 16:47
2

here is a simple way you can split a long string into substrings of length 4. Just adjust the variable n according to your needs:

mystring <- "110010101101"
n <- 2 # n <- nchar(mystring) / 4 -1

sapply(1 + 4*0:n, function(z) substr(mychar, z, z+3))
[1] "1100" "1010" "1101"
Cettt
  • 11,460
  • 7
  • 35
  • 58
  • This seems to be what I was looking for, since it lets me convert the characters to integers without messing anything up. Don't know if that's exactly what my model wants but I'll find out now. – Jaime M. Legaz Jul 05 '19 at 14:55
  • Ok so this definetly worked, although akrun's solution gets the same result in just one line. – Jaime M. Legaz Jul 05 '19 at 15:18
2

You could you stringr to extract all sequences up to 4 characters:

library(stringr)
str_extract_all(seqN, ".{1,4}", simplify = T)[1,]
 [1] "0100" "0010" "0100" "0010" "1000" "0100" "0100" "0100" "0100" "0001" "0010" "1000" "0100" "0100" "1000" "0010" "0100" "1000" "0010"
[20] "0001" "0010" "0001" "0010" "0100" "0100" "0010" "1000" "010" 
Andrew
  • 5,028
  • 2
  • 11
  • 21