1

I have a list of elements to be searched in a dataset of variable lengths. I have tried binary search and I found it is not always efficient when the objective is to search a list of elements.

I did the following study and conclude that if the number of elements to be searched is less than 5% of the data, binary search is efficient, other wise the Linear search is better.

Below are the details
Number of elements : 100000
Number of elements to be searched: 5000
Number of Iterations (Binary Search) = log2 (N) x SearchCount=log2 (100000) x 5000=83048

Further increase in the number of search elements lead to more iterations than the linear search.

Any thoughts on this?

I am calling the below function only if the number elements to be searched is less than 5%.

       private int SearchIndex(ref List<long> entitylist, ref long[] DataList, int i, int len, ref int listcount)
    {
            int Start = i;
            int End = len-1;
            int mid;

            while (Start <= End)
            {
                mid = (Start + End) / 2;


                long target = DataList[mid];

                if (target == entitylist[listcount])
                {
                    i = mid;
                    listcount++;
                    return i;
                }
                else
                {
                    if (target < entitylist[listcount])
                    {
                        Start = mid + 1;
                    }

                    if (target > entitylist[listcount])
                    {
                        End = mid - 1;
                    }
                }
            }
            listcount++;
            return -1; //if the element in the list is not in the dataset


    }

In the code I retun the index rather than the value because, I need to work with Index in the calling function. If i=-1, the calling function resets the value to the previous i and calls the function again with a new element to search.

Raghu
  • 114
  • 2
  • 10
  • 2
    What’s the question? – Norhther Aug 02 '18 at 08:44
  • 2
    log2 (100000)=83048 ? – HRK44 Aug 02 '18 at 08:45
  • 1
    You mean that sorting the list of elements to be searched, you can optimize the search through the sorted dataset. This can be accomplished easily with linear search. However you can limit the binary searches to a subset of the dataset if you start from the sorted list of elements to be searched. Post some working code and I'll show how, if not clear enough. – Sigi Aug 02 '18 at 09:03
  • @HRK44 that a good spot. I updated the the description. – Raghu Aug 02 '18 at 09:40
  • @Sigismondo I have both the dataset and the list of search elements sorted. For the binary search, the start point is set to the previous hit point. In the list of 1M data, if my first element is found at 10k location, the start point for the second search is kept at 10k. I don't know if there is a better way. I have posted the code. – Raghu Aug 02 '18 at 09:49
  • 3
    you're modifying your linear search to fit the problem, but using a naive version of binary search. seems a little unfair :'( – avigil Aug 02 '18 at 09:53
  • @avigil in fact you don't need a navie binary search at all - only observed that you don't need to perform the binary search on the full range, if you have information regarding a subrange where the search can be restricted. – Sigi Aug 03 '18 at 10:27
  • @Sigismondo I know, I'm saying thats what OP is doing wrong – avigil Aug 04 '18 at 14:51

1 Answers1

1

In your problem you are looking for M values in an N long array, N > M, but M can be quite large.

Usually this can be approached as M independent binary searches (or even with the slight optimization of using the previous result as a starting point): you are going to O(M*log(N)).

However, using the fact that also the M values are sorted, you can find all of them in one pass, with linear search. In this case you are going to have your problem O(N). In fact this is better than O(M*log(N)) for M large.

But you have a third option: since M values are sorted, binary split M too, and every time you find it, you can limit the subsequent searches in the ranges on the left and on the right of the found index.

The first look-up is on all the N values, the second two on (average) N/2, than 4 on N/4 data,.... I think that this scale as O(log(M)*log(N)). Not sure of it, comments welcome!

However here is a test code - I have slightly modified your code, but without altering its functionality.

In case you have M=100000 and N=1000000, the "M binary search approach" takes about 1.8M iterations, that's more that the 1M needed to scan linearly the N values. But with what I suggest it takes just 272K iterations.

Even in case the M values are very "collapsed" (eg, they are consecutive), and the linear search is in the best condition (100K iterations would be enough to get all of them, see the comments in the code), the algorithm performs very well.

Sigi
  • 4,826
  • 1
  • 19
  • 23
  • your solution looks interesting about 4 times faster than linear search for large search list. I will review it further and comeback on this. Thanks for the great effort. – Raghu Aug 03 '18 at 17:42