0

I have huge dictonary with some contents in memory which was created by searching similary sentence in a big wikipedia corpus . It has below dictonary format ,when i writed into a file its size was 150mb ,Now before writing it to file i want to preprocess this dictonary and remove sentences that have some cluster name (for example if the cluster name is "sport_Soccer" i want to remove those sentences that are keys in dictonary)inorder to do that i have to loop thorugh this huge dictonary in memory and it take a very long time to filter out , I read about mmap and many said it helps to speed up operations so i tried to load my dictonary using mmap but got below error and all tutorials only show how to load a file using mmap so is mmap restricted only to files and not to datastructures ?

cluster_dict= { .. .. "sentences":"cluster name" .. .. .. }

dd={"the soccer match news will be telecasted live today":"sport_Soccer","The stock markets crashed":"Trading_market"}
ss = mmap.mmap(dd.fileno(), 0)

ss = mmap.mmap(dd.fileno(), 0)

AttributeError: 'dict' object has no attribute 'fileno'

when i just used below code it gave different error ss = mmap.mmap(dd, 0)

TypeError: an integer is required (got type dict)

star
  • 244
  • 1
  • 2
  • 10
  • 1
    Data structures are already in memory; you mmap *files*, so that you can treat it as a sequence of bytes and have indexing operations translated into appropriate reads by the system. `dd` is not a file. – chepner Apr 08 '20 at 19:30
  • @chepner so we can only mmap a file and not datastructure right ? ,but wouldnt it be more time consuming to store the dictonary as file and then read it again using mmap for search and filter task ? Im new to mmap can you tell whats the difference between normal read/write operation using file reader/write vs mmap in python ? im unable to find this anywhere even in stackoverflow . – star Apr 08 '20 at 19:42
  • 1
    Yes, which is why you wouldn't do that. `mmap` is faster than accessing a file through the file system, not faster than accessing something that is already in memory. – chepner Apr 08 '20 at 19:46
  • As @dd already mentioned, `mmap` is for mapping *files*; thus, I presumed that you need to offload this to disk. If that's not the case, what is the design rationale for doing this with a dict? Your specific problem seems served much better by a data frame. – Prune Apr 08 '20 at 19:54
  • so normal read/write using reader and writer file object in python will be slower but when using mmap it will faster is that right ? Do you know how this is possible by mmap to make processing faster in backend ? – star Apr 08 '20 at 19:55
  • @ Prune keeping dataframe aside im very much intersted to implement mmap for my problem inorder to learn new way of solving problem – star Apr 08 '20 at 19:58
  • 1
    Then you need to choose a problem for which `mmap` is a viable solution. – Prune Apr 08 '20 at 20:00
  • I'm calling [X-Y problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) on this. Please update this with a more appropriate question. – Prune Apr 08 '20 at 20:37

1 Answers1

0

dict is a Python data structure, not a file format. If you're trying to store and reload dict data, I recommend that you use the json package. The dump and load methods do what I think you want: a reliable way to store and retrieve key-value data.

Prune
  • 76,765
  • 14
  • 60
  • 81
  • i need a faster way to loop for my search and filter task ,converting to json and storing it in disk and reading again will again cause overhead right ? im not sure if mmap works on json – star Apr 08 '20 at 19:44
  • 1
    @star why store it on disk? Why not keep it in memory? – juanpa.arrivillaga Apr 08 '20 at 19:49
  • @ juanpa.arrivillaga that is what i want to do ,my dictonary already in memory but how to make the search and filter operation faster in memory ?i read about mmap but it support only files and not datastructures like dictonary for faster processing in memory – star Apr 08 '20 at 19:53
  • `mmap` is a faster way to get things *into* memory, not a way to speed up working with things *already* in memory. – chepner Apr 08 '20 at 20:17