0

I have a vector that looks like this:

y =

 Columns 1 through 19:

   1   1   1   1   1   1   1   1   1   1   1   1   2   2   2   2   2   2   2

 Columns 20 through 38:

   2   2   2   2   3   3   3   3   3   3   3   3   3   3   3   4   4   4   4

 Columns 39 through 57:

   4   4   4   4   4   4   4   5   5   5   5   5   5   5   5   5   5   5   6

 Columns 58 through 67:

   6   6   6   6   6   6   6   6   6   6

The vector y is always start at 1 and be counted up. You see that there are lots of same numbers. It's the classes for the samples.

Here we have 1 1 1 1 1 1 1 1 1 1 1 1 = 12 samples for class number 1.

We have 2 2 2 2 2 2 2 2 2 2 2 = 11 samples for class number 2.

My problem here is that I want to find start and stop for every class. For example: Class 1 begins always at index 0 and ends, in this case, at index 11.

Class 2 begins directly after class 1 ends.

Question:

I'm using EJML (Effient Java Matrix Library) and I'm planning to use this function:

C = A.extractMatrix(1,4,2,8) 

Which is equal to this MATLAB code:

C = A(2:4,3:8) 

But I need to find the start and stop indexes from this y vector. In what index does e.g class 3 stops and starts? Do you have any smart ideas how to do that?

Sure, I could use a for-loop, to do this, but for-loops in Java is quite slow because I'm going to have a very very large y vector.

Suggestions?

Edit:

Here is an suggestion. Is that good, or could it be done better?

private void startStopIndex(SimpleMatrix y, int c, Integer[] startStop) {
    int column = y.numCols();
    startStop[0] = startStop[1] + 1; // Begin at the next class
    for(int i = startStop[0]; i < column; i++) {
        if(y.get(i) != c) {
            break;
        }else {
            startStop[1] = i;
        }
    }

}

Assuming that we are calling the method from:

        Integer[] startStop = new Integer[2];
        for(int i = 0; i < c; i++) {
            startStopIndex(y, c, startStop);
        }
euraad
  • 2,467
  • 5
  • 30
  • 51
  • Is something like `1 1 1 2 2 1 1` possible? (several runs of the same number). What about `1 1 3 3 4 4 4`? (skipped number). And `3 3 3 1 1 2 2 2`? (not sorted). Please specify the problem as much as possible – Luis Mendo May 08 '20 at 15:11
  • @LuisMendo No. It's always e.g 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 6 6 6 6 6 7 7 8 8 8 8 8 8 8 8 8 – euraad May 08 '20 at 15:18
  • I doubt the for loop will be slow, but there is no way to find it without looping over the data. Even if the matrix library had a function to do this, it would be looping inside that function. The only other way would be to keep track of the points where it changes while the data is being generated or captured. – David Conrad May 08 '20 at 15:22
  • So: sorted, no skipping numbers? Please include that in the question text – Luis Mendo May 08 '20 at 15:22
  • 1
    For reference, a Matlab solution would be `starts = [1 find(diff(y))+1]; ends = [st(2:end)-1 numel(y)];` (1-based indexing) – Luis Mendo May 08 '20 at 15:26
  • @LuisMendo Yes. I have posted a suggestion. What do you think about that? – euraad May 08 '20 at 15:28
  • @DanielMårtensson I'm not very good at Java, sorry – Luis Mendo May 08 '20 at 15:29
  • @LuisMendo No worry. It's very basic java. – euraad May 08 '20 at 15:30

3 Answers3

1

I think there is a name for this, but I can't remember what it might be, but you start looking for the next boundary with an accelerating search, and use a binary search after that.

You know the numbers are in ascending order, and there are potentially a lot of the same number, so you start by checking the next element. But instead of keep going 1 step at a time, you accelerate and step 2, 4, 8, 16, ... until you find a higher number.

Once you've found a higher number, you've gone too far, but the last step had the initial number, so you know the boundary is somewhere between the last two steps, and you then apply a binary search for the boundary.

Once you've fund the boundary, you start over stepping 1, 2, 4, ... for the next boundary.

If you expect most numbers to have about the same number of occurrences, you could keep a running average count, and make the first step with that average, to get a running start.

I'll leave it to you to actually code this.

Andreas
  • 154,647
  • 11
  • 152
  • 247
  • Have a look at my suggestion in my question. What do you think about that? – euraad May 08 '20 at 15:27
  • @DanielMårtensson Aren't you the one who said *"for-loops in Java is quite slow"* for this, so why are you suggesting doing exactly that? – Andreas May 08 '20 at 15:55
  • Yes. They are slow, and I'm just looking for...if, there is a better answer than mine. – euraad May 08 '20 at 15:57
  • @DanielMårtensson There is, e.g.: My answer. – Andreas May 08 '20 at 15:59
  • But that is exactly the same as I did, except for accelerating search. I don't think that would not be possible. – euraad May 08 '20 at 16:01
  • @DanielMårtensson Of course the accelerating search is possible, that's why I suggested it, to overcome the "slow" simple iteration loop. – Andreas May 08 '20 at 16:12
  • How can I do that? Can you some me with a simple example for an arbitrary language? – euraad May 08 '20 at 16:13
  • @DanielMårtensson I described the logic in the answer. That should be enough to write the code. – Andreas May 08 '20 at 16:15
1

The below is in MATLAB. the for loop will go through each unique value stored in x1 and then find the first and last occurrence of that value.

x = [ 1 1 1 2 2 3 3 3 3 3 4 4 4 4 5 5 5 ]
x1 = unique(x)'

for k1 = 1:length(x1)
    x1(k1,2:3) = [find(x == x1(k1,1),1,"first"), find(x == x1(k1,1),1,"last")];
end

the above code yields x1 to be a 3 column matrix

 1     1     3
 2     4     5
 3     6    10
 4    11    14
 5    15    17
greengrass62
  • 968
  • 7
  • 19
1

If you want to do it faster then binary search is your friend. Threw this together really quick and it does things in O(log n) time, where as a linear search does it in O(n). It's pretty basic and assumes your data looks pretty much like you describe it. Feed it weird data and it will break.:

int[] breakPoints(int[] arr, int low, int high){
    int[] rtrn = new int[high];
    for(int i=low;i<high;i++){
        rtrn[i]=binarySearch(arr, i, 0, arr.length-1);
    }
    return rtrn;
}

int binarySearch(int[] arr, int k, int start, int end){
    int mid = (start+end)/2;
    if(mid==arr.length){
        return -1;
    }
    if(arr[mid]==k && arr[mid+1]==k+1){
        return mid+1; //or just mid if you want before breakpoint
    }
    if(arr[mid]<=k){
        return binarySearch(arr, k, mid+1, end);
    }
    return binarySearch(arr, k, start, mid-1);
}

You'd call it like this:

int[] data = {1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,5,5,6,6,6,6};
int[] bp = breakPoints(data,1,6);
//return 0, 3, 8, 13, 16, 18 
jimboweb
  • 4,362
  • 3
  • 22
  • 45