I have this sequence "ggtacctcctacgggaggcagcagtgaggaattttccgcaatgggcgaaagcctgacgga"
and I want to break it into 3char length units like ggt acc tcc ..etc?
Asked
Active
Viewed 36 times
-1

ROMANIA_engineer
- 54,432
- 29
- 203
- 199

Raed Tabani
- 217
- 3
- 7
-
You cannot. The tokinizer needs a delimiter. Why not just iterate through the string in 3 char increments? – Kai Mattern Mar 01 '15 at 18:37
-
SMA showed that you can. However it is really bad performing. – CoronA Mar 02 '15 at 15:40
4 Answers
0
Try something like:
String str[] = s.split("(?<=\\G...)");
Output
[ggt, acc, tcc, tac, ggg, agg, cag, cag, tga, gga, att, ttc, cgc, aat, ggg, cga, aag, cct, gac, gga]

SMA
- 36,381
- 8
- 49
- 73
0
Do not use a Stringtokenizer. The regular expression to split is really inefficient - DNA/RNA-Strings are really long.
In Java 8 one could do following solution:
public static void main(String[] args) {
String str = "ggtacctcctacgggaggcagcagtgaggaattttccgcaatgggcgaaagcctgacgga";
List<String> collect = str.chars()
.mapToObj(accumulator(3))
.filter(s -> s != null)
.collect(Collectors.toList());
System.out.println(collect);
}
private static IntFunction<String> accumulator(final int size) {
return new CharAccumulator(size);
}
private static final class CharAccumulator implements IntFunction<String> {
private StringBuilder builder ;
private int size;
private CharAccumulator(int size) {
this.builder = new StringBuilder();
this.size = size;
}
@Override
public String apply(int value) {
builder.append((char) value);
if (builder.length() == size) {
String result = builder.toString();
builder.setLength(0);
return result;
} else {
return null;
}
}
}
It is not as easy to understand and maybe not as performant but it works also with lazy char streams (saves memory).

CoronA
- 7,717
- 2
- 26
- 53
0
You could try something like the following, where you could convert the String to a char[] and loop through them in units of 3 in order to get that String:
String str = "ggtacctcctacgggaggcagcagtgaggaattttccgcaatgggcgaaagcctgacgga";
char[] array = str.toCharArray();
List<String> result = new ArrayList<String>();
for(int i = 0; i<array.length; i+=3)
{
StringBuilder s = new StringBuilder();
for(int j = i ; j<array.length && j < i+3; j++)
{
s.append(array[j]);
}
result.add(s.toString());
}
The List results now contains strings of length three, and it does not break if the size is not a multiple of three.

Gregory Basior
- 300
- 1
- 9
0
Here is another solution that uses the substring
method (without StringTokenizer
):
public static void main(String[] args) {
String s = "ggtacctcctacgggaggcagcagtgaggaattttccgcaatgggcgaaagcctgacgga";
char[][] c = new char[s.length()/3][3];
for ( int i = 0 ; i < s.length() ; i+=3 ) {
String substring = s.substring(i, i+3);
c[i/3] = substring.toCharArray();
}
// test
for ( int i = 0 ; i < c.length ; i++ ) {
for ( int j = 0 ; j < c[0].length ; j++ ) {
System.out.print(c[i][j]);
}
System.out.println();
}
}

ROMANIA_engineer
- 54,432
- 29
- 203
- 199