0

I am using the bisect module to search for and insert sha256 hashes into a list.

I have about 8,000,000 items to search and add to, and they are stored in an sqlite database, I want to read them into the list so I can search them quicker.

The issue I have is inserting items into the list using bisect to find the right insertion point is quite slow. It takes about 700 seconds to complete all 8,000,000 items.

It only takes about 90 seconds to create an index in the sqlite database in ascending order, and then about 60 seconds to insert these into the list sequentially.

The trouble is when I do this the bisect search for some items fail, but if I sequentially search the item for the hash it is actually there.

So it appears as though the order provided by the database isnt quite the same as the order provided when using bisect to get the index position.

Any ideas why this would be? it would be really useful to be able to pre-sort the list before relying on bisect.

UPDATE.... Based on a comment, I should explain that I have a custom class that behaves like a list, that packs the hashes in a bytearray to save memory. Here is my Class

class Hashlist():

def __init__(self, hashLen):
    self.__hashLen = hashLen
    self.__hashlist = bytearray()
    self.__num_items = 0

def __getitem__(self, index):
    if index >= len(self) or index < 0: 
        print index
        raise IndexError("hash index out of range")
        return 
    return str(self.__hashlist[index*self.__hashLen:(index+1)*self.__hashLen])

def __setitem__(self, index, data):
    if index > len(self) or index < 0: 
        raise IndexError("hash index out of range")
        return 
    if index == len(self):
        self.__hashlist.extend(data)
    else:
        self.__hashlist[index*self.__hashLen:(index+1)*self.__hashLen] = data

def insert(self, index, data):
    oldlen = len(self.__hashlist)/self.__hashLen
    if index > oldlen  or index < 0:
        raise IndexError("trying to insert past next element")
        return
    if index == oldlen:
        self.__hashlist.extend(data)
    else:
        # move the data
        if self.__hashLen == 1:
            self.__hashlist.append(chr(0))
            orig_data = str(self.__hashlist[(index):(len(self.__hashlist)-1)])
            self.__hashlist[(index + 1)*self.__hashLen:(len(self.__hashlist))*self.__hashLen] = orig_data
            #replace existing data
            self.__hashlist[index*self.__hashLen:(index+1)*self.__hashLen] = data
        else:
            orig_data = str(self.__hashlist[(index*self.__hashLen):(len(self.__hashlist) -1)*self.__hashLen])
            self.__hashlist[(index + 1)*self.__hashLen:(len(self.__hashlist))*self.__hashLen] = orig_data
            #replace existing data
            self.__hashlist[index*self.__hashLen:(index+1)*self.__hashLen] = data

Thanks

Dean

Deano123
  • 161
  • 1
  • 9

1 Answers1

0

If they're stored in a SQL database, an index doesn't guarantee that results are returned in "sorted" order - you have to be explicit by using "order by".

Also, if you're doing that many inserts then I wouldn't use bisect, instead sort/merge.

# Add new to old and sort the whole lot...
old_hash_list.extend(new_hash_list)
old_hash_list.sort()

# Assuming new is already sorted than create new list of merged
import heapq
old_and_new = list(heapq.merge(old_hash_list, sorted(new_hash_list)))
Jon Clements
  • 138,671
  • 33
  • 247
  • 280
  • I have created my own class that stores the hashes in a byte array. I created the `__getitem__` and `__setitem__` methods to make it behave like a list so I could use bisect. I didnt realise i needed to use the `order by` as well if the index already was in order. I will give that a go. – Deano123 Jan 18 '14 at 12:22
  • @Deano123 you may want to explain you've got custom classes with custom methods and not just sha256 bytestrings then... Also, bisection requires types to be orderable, not just comparable... So, you're missing out a large chunk of the picture for people to be able to help you here... – Jon Clements Jan 18 '14 at 12:25
  • Ill bare that in mind. I will add some more detail now, but just to let you know the order by thing made it work perfectly. Thanks for the it – Deano123 Jan 18 '14 at 12:30
  • @Deano123 ahh okay - well, glad that solved your problem... You may want to look at an ordered key store if order of keys is important and you're doing these operations a lot. But glad the order by "sorted" it *cough* :) – Jon Clements Jan 18 '14 at 12:37
  • The biggest problem I had was memory usage, using the standard python objects. 8,000,000 32 Byte Keys,and 8,000,000 32 Byte Values used 3.5Gb of RAM. My new structure uses about 1Gb and is more than fast enough for my needs (now I have "sorted the problem" with the database) Thanks Again – Deano123 Jan 18 '14 at 13:05