12

How to split a byte[] around a byte sequence in Java? Something like the byte[] version of String#split(regex).

Example

Let's take this byte array:
[11 11 FF FF 22 22 22 FF FF 33 33 33 33]

and let's choose the delimiter to be
[FF FF]

Then the split will result in these three parts:
[11 11]
[22 22 22]
[33 33 33 33]

EDIT:

Please note that you cannot convert the byte[] to String, then split it, then back because of encoding issues. When you do such conversion on byte arrays, the resulting byte[] will be different. Please refer to this: Conversion of byte[] into a String and then back to a byte[]

Community
  • 1
  • 1
Ori Popowski
  • 10,432
  • 15
  • 57
  • 79
  • No, it's not. Please read more carefully. – Ori Popowski Mar 19 '14 at 22:30
  • Iterate over the array; compare the next `delimiter.length` bytes to the delimiter, and split as needed? What exactly are you having trouble with? – Henry Keiter Mar 19 '14 at 22:42
  • Yes, I can do this, but I'm looking for an existing solution, not reinventing the wheel. It's a good practice to reuse existing, proven, tested code than writing on your own. – Ori Popowski Mar 19 '14 at 22:44
  • Is encoding an issue because you're dealing with guaranteed non-textual data or is this an artificial constraint? If you know what the encoding is going to be, it stops being a problem. – avgvstvs Mar 19 '14 at 23:02
  • 1
    Possible duplicate of http://stackoverflow.com/questions/1387027/java-regex-on-byte-array?rq=1 – avgvstvs Mar 19 '14 at 23:03
  • @avgvstvs Yes, I'm dealing with guaranteed non-textual data. – Ori Popowski Mar 19 '14 at 23:10

7 Answers7

11

Here is a straightforward solution.

Unlike avgvstvs approach it handles arbitrary length delimiters. The top answer is also good, but the author hasn't fixed the issue pointed out by Eitan Perkal. That issue is avoided here using the approach Perkal suggests.

public static List<byte[]> tokens(byte[] array, byte[] delimiter) {
        List<byte[]> byteArrays = new LinkedList<>();
        if (delimiter.length == 0) {
            return byteArrays;
        }
        int begin = 0;

        outer:
        for (int i = 0; i < array.length - delimiter.length + 1; i++) {
            for (int j = 0; j < delimiter.length; j++) {
                if (array[i + j] != delimiter[j]) {
                    continue outer;
                }
            }
            byteArrays.add(Arrays.copyOfRange(array, begin, i));
            begin = i + delimiter.length;
        }
        byteArrays.add(Arrays.copyOfRange(array, begin, array.length));
        return byteArrays;
    }
L. Blanc
  • 2,150
  • 2
  • 21
  • 31
8

Note that you can reliably convert from byte[] to String and back, with a one-to-one mapping of chars to bytes, if you use the encoding "iso8859-1".

However, it's still an ugly solution.

I think you'll need to roll your own.

I suggest solving it in two stages:

  1. Work out how to find the of indexes of each occurrence of the separator. Google for "Knuth-Morris-Pratt" for an efficient algorithm - although a more naive algorithm will be fine for short delimiters.
  2. Each time you find an index, use Arrays.copyOfRange() to get the piece you need and add it to your output list.

Here it is using a naive pattern finding algorithm. KMP would become worth it if the delimiters are long (because it saves backtracking, but doesn't miss delimiters if they're embedded in sequence that mismatches at the end).

public static boolean isMatch(byte[] pattern, byte[] input, int pos) {
    for(int i=0; i< pattern.length; i++) {
        if(pattern[i] != input[pos+i]) {
            return false;
        }
    }
    return true;
}

public static List<byte[]> split(byte[] pattern, byte[] input) {
    List<byte[]> l = new LinkedList<byte[]>();
    int blockStart = 0;
    for(int i=0; i<input.length; i++) {
       if(isMatch(pattern,input,i)) {
          l.add(Arrays.copyOfRange(input, blockStart, i));
          blockStart = i+pattern.length;
          i = blockStart;
       }
    }
    l.add(Arrays.copyOfRange(input, blockStart, input.length ));
    return l;
}
slim
  • 40,215
  • 13
  • 94
  • 127
  • It's always good to read the book C Programming Language where it has ton of exercises that force you to come up with these kind of solutions. Then you can move to Java with that toolset under your belt. – JohnMerlino Jun 26 '14 at 17:50
  • 2
    The above code will fail if the input end with the start of the pattern (java.lang.ArrayIndexOutOfBoundsException), for example: byte[] pattern= { (byte) 0x43, (byte) 0x23}; byte[] input = { (byte) 0x08, (byte) 0x01, (byte) 0x53, (byte) 0x43}; - one simple solution is to change the split method in: for(int i=0; i – Eitan Rimon Nov 13 '14 at 14:16
  • the line `i = blockStart;` also is incorrect, since `i++` is executed afterwards. The Problem will occur with patterns of length 1. – IARI Sep 20 '17 at 08:56
4

I modified 'L. Blanc' answer to handle delimiters at the very beginning and at the very end. Plus I renamed it to 'split'.

private List<byte[]> split(byte[] array, byte[] delimiter)
{
   List<byte[]> byteArrays = new LinkedList<byte[]>();
   if (delimiter.length == 0)
   {
      return byteArrays;
   }
   int begin = 0;

   outer: for (int i = 0; i < array.length - delimiter.length + 1; i++)
   {
      for (int j = 0; j < delimiter.length; j++)
      {
         if (array[i + j] != delimiter[j])
         {
            continue outer;
         }
      }

      // If delimiter is at the beginning then there will not be any data.
      if (begin != i)
         byteArrays.add(Arrays.copyOfRange(array, begin, i));
      begin = i + delimiter.length;
   }

   // delimiter at the very end with no data following?
   if (begin != array.length)
      byteArrays.add(Arrays.copyOfRange(array, begin, array.length));

   return byteArrays;
}
Roger
  • 7,062
  • 13
  • 20
0

Rolling your own is the only way to go here. The best idea I can offer if you're open to non-standard libraries is this class from Apache:

http://commons.apache.org/proper/commons-primitives/apidocs/org/apache/commons/collections/primitives/ArrayByteList.html

Knuth's solution is probably the best, but I would treat the array as a stack and do something like this:

List<ArrayByteList> targetList = new ArrayList<ArrayByteList>();
while(!stack.empty()){
  byte top = stack.pop();
  ArrayByteList tmp = new ArrayByteList();

  if( top == 0xff && stack.peek() == 0xff){
    stack.pop();
    continue;
  }else{
    while( top != 0xff ){
      tmp.add(stack.pop());
    }
    targetList.add(tmp);
  }
}

I'm aware that this is pretty quick and dirty but it should deliver O(n) in all cases.

avgvstvs
  • 6,196
  • 6
  • 43
  • 74
  • Fine for a simple two-byte delimiter but doesn't address more complex patterns -- which might be OK for the OP. – slim Mar 20 '14 at 11:23
0

It's some improvement to the answer https://stackoverflow.com/a/44468124/1291605 of Roger: let's imagine that we have such array ||||aaa||bbb and delimiter ||. In this case we get

java.lang.IllegalArgumentException: 2 > 1
    at java.util.Arrays.copyOfRange(Arrays.java:3519)

So the final improved solution:

public static List<byte[]> split(byte[] array, byte[] delimiter) {
        List<byte[]> byteArrays = new LinkedList<>();
        if (delimiter.length == 0) {
            return byteArrays;
        }
        int begin = 0;

        outer:
        for (int i = 0; i < array.length - delimiter.length + 1; i++) {
            for (int j = 0; j < delimiter.length; j++) {
                if (array[i + j] != delimiter[j]) {
                    continue outer;
                }
            }

            // This condition was changed
            if (begin != i)
                byteArrays.add(Arrays.copyOfRange(array, begin, i));
            begin = i + delimiter.length;
        }

        // Also here we may change condition to 'less'
        if (begin < array.length)
            byteArrays.add(Arrays.copyOfRange(array, begin, array.length));

        return byteArrays;
    }
Artem
  • 21
  • 3
-3

You can use Arrays.copyOfRange() for that.

Maysam Torabi
  • 3,672
  • 2
  • 28
  • 31
-4

Refer to Java Doc for String

You can construct a String object from byte array. Guess you know the rest.

public static byte[][] splitByteArray(byte[] bytes, byte[] regex, Charset charset) {
    String str = new String(bytes, charset);
    String[] split = str.split(new String(regex, charset));
    byte[][] byteSplit = new byte[split.length][];
    for (int i = 0; i < split.length; i++) {
        byteSplit[i] = split[i].getBytes(charset);
    }
    return byteSplit;
}

public static void main(String[] args) {
    Charset charset = Charset.forName("UTF-8");
    byte[] bytes = {
        '1', '1', ' ', '1', '1',
        'F', 'F', ' ', 'F', 'F',
        '2', '2', ' ', '2', '2', ' ', '2', '2',
        'F', 'F', ' ', 'F', 'F',
        '3', '3', ' ', '3', '3', ' ', '3', '3', ' ', '3', '3'
    };
    byte[] regex = {'F', 'F', ' ', 'F', 'F'};
    byte[][] splitted = splitByteArray(bytes, regex, charset);
    for (byte[] arr : splitted) {
        System.out.print("[");
        for (byte b : arr) {
            System.out.print((char) b);
        }
        System.out.println("]");
    }
}
devmtl
  • 48
  • 1
  • 6
  • I would recommend writing a little sample code so that users with the same problem can find the answer more easily. Because they might not "know the rest". Thanks! – Andrew Gies Mar 19 '14 at 22:22
  • It won't work: http://stackoverflow.com/questions/2758654/conversion-of-byte-into-a-string-and-then-back-to-a-byte – Ori Popowski Mar 19 '14 at 22:26