1

I am implementing a pattern search algorithm that has a vital if statement that seems to be unpredictable in it's result. Random files are searched and thus sometimes the branch predictions are okay and sometimes they can be terrible if the file is completely random. My goal is to eliminate the if statement and I have tried but it has yielded slow results like preallocating a vector. The number of pattern possibilities can be very large so preallocating takes up a lot of time. I therefore have the dynamic vector where I initialize them all with NULL up front and then check with the if statement if a pattern is present. The if seems to be killing me and specifically the cmp assembly statement. Bad branch predictions are scrapping the pipeline a lot and causing huge slow downs. Any ideas would be great as to eliminate the if statement at line 17...stuck in a rut.

for (PListType i = 0; i < prevLocalPListArray->size(); i++)
{
    vector<vector<PListType>*> newPList(256, NULL);

    vector<PListType>* pList = (*prevLocalPListArray)[i];
    PListType pListLength = (*prevLocalPListArray)[i]->size();

    PListType earlyApproximation = ceil(pListLength/256);

    for (PListType k = 0; k < pListLength; k++)
    {
        //If pattern is past end of string stream then stop counting this pattern
        if ((*pList)[k] < file->fileStringSize)
        {
            uint8_t indexer = ((uint8_t)file->fileString[(*pList)[k]]);

            if(newPList[indexer] != NULL) //Problem if statement!!!!!!!!!!!!!!!!!!!!!
            {
                newPList[indexer]->push_back(++(*pList)[k]);
            }
            else
            {
                newPList[indexer] = new vector<PListType>(1, ++(*pList)[k]);
                newPList[indexer]->reserve(earlyApproximation);
            }
        }
    }

    //Deallocate or stuff patterns in global list
    for (int z = 0; z < newPList.size(); z++)
    {
        if(newPList[z] != NULL)
        {
            if (newPList[z]->size() >= minOccurrence)
            {
                globalLocalPListArray->push_back(newPList[z]);
            }
            else
            {
                delete newPList[z];
            }
        }
    }
    delete (*prevLocalPListArray)[i];
}

Here is the code without indirection with the changes proposed...

    vector<vector<PListType>> newPList(256);

    for (PListType i = 0; i < prevLocalPListArray.size(); i++)
    {
        const vector<PListType>& pList = prevLocalPListArray[i];
        PListType pListLength = prevLocalPListArray[i].size();

        for (PListType k = 0; k < pListLength; k++)
        {
            //If pattern is past end of string stream then stop counting this pattern
            if (pList[k] < file->fileStringSize)
            {
                uint8_t indexer = ((uint8_t)file->fileString[pList[k]]);

                newPList[indexer].push_back((pList[k] + 1));
            }
            else
            {
                totalTallyRemovedPatterns++;
            }
        }
        for (int z = 0; z < 256; z++)
        {

            if (newPList[z].size() >= minOccurrence/* || (Forest::outlierScans && pList->size() == 1)*/)
            {
                globalLocalPListArray.push_back(newPList[z]);
            }
            else
            {
                totalTallyRemovedPatterns++;
            }
            newPList[z].clear();
        }
        vector<PListType> temp;
        temp.swap(prevLocalPListArray[i]);
    }

Here is the most up to date program that manages to not use 3 times the memory and does not require an if statement. The only bottleneck seems to be the newPList[indexIntoFile].push_back(++index); statement. This bottleneck could be cache coherency issues when indexing the array because the patterns are random. When i search a binary files with just 1s and 0s I don't have any latency with indexing the push back statement. That is why I believe it is cache thrashing. Do you guys see any room for optimization in this code still? You guys have been a great help so far. @bogdan @harold

vector<PListType> newPList[256];
PListType prevPListSize = prevLocalPListArray->size();
PListType indexes[256] = {0};
PListType indexesToPush[256] = {0};
for (PListType i = 0; i < prevPListSize; i++)
{
    vector<PListType>* pList = (*prevLocalPListArray)[i];
    PListType pListLength = (*prevLocalPListArray)[i]->size();
    if(pListLength > 1)
    {
        for (PListType k = 0; k < pListLength; k++)
        {
            //If pattern is past end of string stream then stop counting this pattern
            PListType index = (*pList)[k];
            if (index < file->fileStringSize)
            {
                uint_fast8_t indexIntoFile = (uint8_t)file->fileString[index];
                newPList[indexIntoFile].push_back(++index); 
                indexes[indexIntoFile]++;

            }
            else
            {
                totalTallyRemovedPatterns++;
            }
        }

        int listLength = 0;
        for (PListType k = 0; k < 256; k++)
        {
            if( indexes[k])
            {
                indexesToPush[listLength++] = k;
            }
        }

        for (PListType k = 0; k < listLength; k++)
        {
            int insert = indexes[indexesToPush[k]];
            if (insert >= minOccurrence)
            {
                int index = globalLocalPListArray->size();

                globalLocalPListArray->push_back(new vector<PListType>());
                (*globalLocalPListArray)[index]->insert((*globalLocalPListArray)[index]->end(), newPList[indexesToPush[k]].begin(), newPList[indexesToPush[k]].end());
                indexes[indexesToPush[k]] = 0;
                newPList[indexesToPush[k]].clear();
            }
            else if(insert == 1)
            {
                totalTallyRemovedPatterns++;
                indexes[indexesToPush[k]] = 0;
                newPList[indexesToPush[k]].clear();
            }

        }
    }
    else
    {
        totalTallyRemovedPatterns++;
    }
    delete (*prevLocalPListArray)[i];
}

Here are the benchmarks. I didn't think it would be readable in the comments so I am placing it in the answer category. The percentages to the left define how much time is spent percentage wise on a line of code.

enter image description here

  • 2
    Very simple, just create every vector. – harold Jul 12 '16 at 10:55
  • So instead of vector*> newPList(256, NULL); do------> vector*> newPList; for (PListType k = 0; k < 256 k++) { newPList.push_back(new vector()); } – PeterMorley Jul 12 '16 at 11:06
  • `vector>`. Eliminate unnecessary indirections. – bogdan Jul 12 '16 at 11:27
  • I removed the indirections and allocated 256 vector elements up front and as expected it takes up more memory and also makes it slower. I can't seem to win – PeterMorley Jul 12 '16 at 12:32
  • Anyway, the `cmp` being slow is not necessarily indicative of the comparison being slow. It's the first instruction that uses the result of the memory read, so it will be "blamed" for the latency of the read. – harold Jul 12 '16 at 12:46
  • That makes a lot of sense but then I look at the previous if statement.... if ((*pList)[k] < file->fileStringSize)...and that does not take up 1 percent of the time the latter if takes and it is also reading from memory in pList vector. The difference between the two if statements is that the first if is mostly always true while the nested if is unpredictable. – PeterMorley Jul 12 '16 at 12:51
  • Yes, but an other difference is that the read in that first condition is unlikely to ever trigger a cache miss. – harold Jul 12 '16 at 12:56
  • Makes it slower? I find that very strange. Can you post the updated code? – bogdan Jul 12 '16 at 13:11
  • Ok is that due to traversing the memory contiguously in the first if statement while the indexing of the second if is random between 0 and 255? The implementations of vector vary in terms of object memory needed but let's say 32 bytes so we have a contiguous block of 8K and L1 Data cache is 32K. – PeterMorley Jul 12 '16 at 13:29
  • @bogdan I posted the code up above in the question section – PeterMorley Jul 12 '16 at 13:34
  • Wait I think I analysed this wrong. Anyway, can you post the the benchmark results? – harold Jul 12 '16 at 13:40
  • 1. In the first line of the main loop, you're now copying a vector, which you weren't doing previously; that wasn't an "unnecessary indirection" :-). Change that to `vector& pList`. This may be what made the new version run more slowly (confirm, please). 2. `newPList` and its `reserve`s can be taken out of the main loop and reused between iterations. At the end of each iteration, instead of `vector().swap(newPList[z]);`, just do `newPList[z].clear()`. Let us know how it goes. – bogdan Jul 12 '16 at 13:42
  • The changes you proposed made the program faster by 2 seconds!!! The issue is that the memory consumption is still much larger than the previous version by at least 3 fold. @harold I will push my benchmark results from visual studio or did you have different benchmark results in mind? – PeterMorley Jul 12 '16 at 14:52
  • One benchmark I can reference from the profiler is that 72% of the project's time spent is on the newPList[indexer].push_back((pList[k] + 1)); line. Well the only calls that take up time from calling push_back result in an accumulated total of 6.9%. Now where does the 65% percent of processing time go??? I believe cache misses are still occuring on reading newPList[indexer]. I just ran out of my vTune 30 day trial license so I can't access the assembly info at the moment. – PeterMorley Jul 12 '16 at 15:02
  • 2 seconds out of how many? :-) There's one thing I missed previously: `globalLocalPListArray.push_back(newPList[z]);`; this will do an array copy. It's worth trying `push_back(std::move(newPList[z]))`. This will do both a good thing (no array copy) and a not-so-good thing (steal the memory from `newPList[z]`, drastically reducing the benefit of reusing it), so I'm not sure which way it will go, but it's definitely worth trying. – bogdan Jul 12 '16 at 15:29
  • went from 17 to 16 seconds after more testing. I also tried the move which made it only a tad slower. Still unsure why the memory footprint has grown so much. – PeterMorley Jul 12 '16 at 15:58
  • Well, in your initial solution the memory usage of `newPList` was tailored to the configuration determined by each `prevLocalPListArray[i]`; now it doesn't shrink, but grows if necessary to accomodate different configurations. A threefold increase seems to indicate that, on average, each `prevLocalPListArray[i]` loads more heavily about one third of the `indexer`s and leaves the rest only slightly populated if at all. – bogdan Jul 12 '16 at 16:27
  • Anyway, not messing with the heap so much helps, so I would go further and remove the cleanup of `prevLocalPListArray[i]` at the end of each loop and replace it with just one `prevLocalPListArray.clear()` outside of the main loop. It will keep more memory around but possibly improve the layout; something to try. You also seem to have removed the initial `reserve`s from the setup of `newPList`; were they not making any difference? It may help to see what average (or maximum) value `earlyApproximation` used to have and use that for an initial `reserve` across `newPList`. – bogdan Jul 12 '16 at 16:36
  • The reserve was not making a significant increase in speed. I think the major bottleneck now is dealing with fetching memory for the newPList[indexer].push_back(pList[k] + 1). – PeterMorley Jul 12 '16 at 17:35
  • It's possible I need to redesign to get the throughput I am looking for. – PeterMorley Jul 12 '16 at 18:00
  • If I presorted a file I could make the cache hits more predictable. I would just need to map each character in the file to the actual index in the file, then sort that hash. Once the hash is sorted I could effectively loop through the list and only have to bring the newPList[indexer] to cache once until no more patterns of that existed in the list and then moved onto the next. That is one way I could think of making this crap work better for cache coherency. – PeterMorley Jul 12 '16 at 19:42

0 Answers0