2

I am trying to keep a list of top k elements of a large set of tuples. Since keeping it in the memory is impossible, I want to use a fixed size list to keep only the top k values (with the keys). I have tried to use min heap but python's heap is terrible as it lets non unique keys to be inserted. That is a huge problem. So I thought I can use a sorted list/dict instead (tuples with unique keys). Using the sketch function I retrieve the number of counts that substring has appeared in the whole text (O(1) time)). I am beginning to think I do something wrong with the loops or pops and assignments, because the minheap also has a similar problem where only the top k shows up in the 25 size list, and the rest are rather low counts (when it is in fact higher )

for line in lines[1::4]:

    startIdx = 0
    while startIdx + k <= (len(line)-k):
        kmer = line[startIdx:(startIdx+k)]
        count = randint(1, 250)  

        if count > 2:
           if len(tdict.keys()) < topcount: 
                tdict[km] = count
           else:
                kMin = (sorted(tdict,reverse = False, key=lambda x: x[1]))
                if count > tdict[kMin[0]]:  
                     topkmerdict.pop(kMin[0])
                     topkmerdict[km] = count
        startIdx += 1

    linesProcessed += 1
dusa
  • 840
  • 3
  • 14
  • 31
  • It is hard to solve your problem, because it isn't entirely clear. Your code refers to variables `sketch`, and `topkmerdict` that are external to the code. Please read up on writing a [mcve] and edit the question accordingly. Having the appropriate inputs and both the expected and actual outputs would help both you and everyone else debug your problem. I know you said you're reading more than you can keep in memory, but you should be able to test the algorithm with a smaller dataset. Pass that minimal dataset on to us, with expected outputs, and then we can help you fix your problem. – Scott Mermelstein Jul 28 '17 at 17:41
  • @ScottMermelstein Thanks, I have edited to add a sample file, file read parts of the code and changed sketch function to return a random number which functionality acts similar (returns count in int). – dusa Jul 28 '17 at 17:54
  • have you looked at heapq it might do all you need? – paddyg Jul 28 '17 at 18:09
  • @paddyg I did, I also have a version where I do min heap, but the problem is, unlike a set or dict, heapq lets non-unique items be added over and over again (since my keys are substrings I have tons of them). – dusa Jul 28 '17 at 18:13

1 Answers1

1

Please try changing the line:

  kmerMin = (sorted(topkmerdict,reverse = False, key=lambda x: x[1]))

to:

  kmerMin = (sorted(topkmerdict,reverse = False)

The previous line is only sorting on the second character of the string key values.