2

I have to write a program for an application in C++ which generates n-bit binary strings which need to be stored for further processing.

Question 1) But whenever a new string is generated , it needs to be checked if it's already present in the database. If it is, it should not be added.

One possible way I could do is to maintain a hash table for lookup (STL map for example) where the keys are the decimal value of the binary string. But the problem is n can be very large that storing it's decimal value is not feasible. That is sometimes n can be as large as 200+ .

Also, sometimes the bits of the n-bit string are unspecified. For example :- if n = 4 , a string may be of the form 01xx . Where the lower two bits are unspecified. In this case, 01xx actually represents 4 fully specified 4-bit strings - 0100,0101,0110,0111 . Thus , if 01xx is in the database and 0110 is produced, then 0110 should not be stored in the database.

Can you suggest what might be an efficient way to check this.

Someways I can think up of is :-

1) Seach entire database of the strings and compare the newly generated string one by one with the strings in the database. This is a naive method and will have a complexity of O(mn) where m is the number of strings currently in the database.

2) Store the strings in a binary decision tree type structure . In this type of method the lookup will be logarithmic ?

3) For each bit position in the string - I store the strings where it's value is specified. For example :- for n = 4, if the database contains :- 01xx and 1xx1 then this information can be stored as :-

0 - 1xx1

1 -

2 - 01xx

3 - 01xx,1xx1

0 signifies the LSB is set. 3 signifies that the MSB is set. So if a new string say 0101 is generated I can search for it either in 2 or in 3 . This method seems expensive on the memory usage.

Can you suggest some efficient ways to go about doing this string search .

Question 2) Also in terms of a C++ implementation, what might be an efficient way to store these n-bit strings ? It should be noted that most of the time the majority of the bits in the n-bit string are unspecified. Thus, instead of reserving a space in the memory that is proportional to n, it makes more sense to store only the bits which are specified.

That is n may be 10. But the string generated may be something like :- 1x1xxxxxxx . In this case it makes more sense to store something like {(9,1),(7,1)} . So should I store the strings as vectors of 2-tuples ? In that case, what might be a good way to store the database of these strings ?

ameyask
  • 255
  • 1
  • 9
  • 1
    For the storage... sounds like you have your bits in tri-states. Why don't you use that fact and store your numbers as such, in trinary (base 3) format? Converting between 2-base and 3-base is trivial enough. – YePhIcK Oct 04 '15 at 06:06
  • Okay, so should I use an array of size n ? Or a vector ? (I am always confused between which one to use ) Especially since n is fixed. Secondly how do I store the database of these strings ? – ameyask Oct 04 '15 at 06:10
  • What have you tried so far? Where did you get stuck? (But: I would look at a hash table.) – Davislor Oct 04 '15 at 06:13
  • You may want to look at the *trie* data structure. – n. m. could be an AI Oct 04 '15 at 08:16
  • @ameyask86 Using a vector will be fine as most compilers will optimize this to use single binary values instead of the 8 bytes of a boolean for more information see http://www.cplusplus.com/reference/vector/vector-bool/ . I would advise against using arrays as they will not optimize out the 8 byte boolean. Probably the fastest most user friendly approach would be to use a bitset which acts like an array, that is purpose built for binary manipulation. Also it sounds like you are trying to build a compression library, in which case why not use a tried and tested library such as zlib??? – silvergasp Oct 04 '15 at 10:29
  • How much of the data is unspecified? If the amount of unspecified data relates to massive sub-trees, then a data structure would probably win. Is n a variable? - Is the structure only capable of storing n-bits, or could it be m (<=n) bits. My gut feeling, is some form of custom data structure, based on work from compression software for storing known prefixes. What is the maximum number of un-specified bits (o) and how does it relate to n. – mksteve Oct 04 '15 at 12:01

0 Answers0