I have to write a program for an application in C++ which generates n-bit binary strings which need to be stored for further processing.
Question 1) But whenever a new string is generated , it needs to be checked if it's already present in the database. If it is, it should not be added.
One possible way I could do is to maintain a hash table for lookup (STL map for example) where the keys are the decimal value of the binary string. But the problem is n can be very large that storing it's decimal value is not feasible. That is sometimes n can be as large as 200+ .
Also, sometimes the bits of the n-bit string are unspecified. For example :- if n = 4 , a string may be of the form 01xx . Where the lower two bits are unspecified. In this case, 01xx actually represents 4 fully specified 4-bit strings - 0100,0101,0110,0111 . Thus , if 01xx is in the database and 0110 is produced, then 0110 should not be stored in the database.
Can you suggest what might be an efficient way to check this.
Someways I can think up of is :-
1) Seach entire database of the strings and compare the newly generated string one by one with the strings in the database. This is a naive method and will have a complexity of O(mn) where m is the number of strings currently in the database.
2) Store the strings in a binary decision tree type structure . In this type of method the lookup will be logarithmic ?
3) For each bit position in the string - I store the strings where it's value is specified. For example :- for n = 4, if the database contains :- 01xx and 1xx1 then this information can be stored as :-
0 - 1xx1
1 -
2 - 01xx
3 - 01xx,1xx1
0 signifies the LSB is set. 3 signifies that the MSB is set. So if a new string say 0101 is generated I can search for it either in 2 or in 3 . This method seems expensive on the memory usage.
Can you suggest some efficient ways to go about doing this string search .
Question 2) Also in terms of a C++ implementation, what might be an efficient way to store these n-bit strings ? It should be noted that most of the time the majority of the bits in the n-bit string are unspecified. Thus, instead of reserving a space in the memory that is proportional to n, it makes more sense to store only the bits which are specified.
That is n may be 10. But the string generated may be something like :- 1x1xxxxxxx . In this case it makes more sense to store something like {(9,1),(7,1)} . So should I store the strings as vectors of 2-tuples ? In that case, what might be a good way to store the database of these strings ?