0

I have two Strings that I am checking for specific common words in both of them. I already have the semantic scores; irrelevant in this case as these words are technical abbreviations and have special emphasis. The more set of common words they have, higher the score and closer they are.

There are many ways of going about this. So far I have thought of two.

1) I create two ArrayList with the words of the strings. I have to another set of words that I search if they exist in both the ArrayList. If they do, I give them a score +1.

then I can have multiple conditions like

 if((firstString.contains(keyWord)) && (secondString.contains(keyWord)))
  then +1
 if((firstString.contains(anotherKeyWord)) && (secondString.contains(anotherKeyWord)))
  then +1

2> Take two string and have regex search using

if firstString.("(.*)someExpression(.*)")) && secondString.("(.*)someExpression(.*)"))
then +1
if firstString.("(.*)someOtherExpression(.*)")) && secondString.("(.*)someOtherExpression(.*)"))
then +1

Are there other better ways of doing this? I am more inclined to use regex now. It looks pretty efficient way of doing this.

Basically what I am doing is I am trying to cluster similar sentences by grouping sentences with abbreviations such as "ACLS", "ASHD", "CXR" (Common medical terms) as I know these sentences talk about those issues primarily. Then I get semantic scores to group those sentences that have these words in them. Wrong Approach :/ ?

Thank you :)

2 Answers2

1

If there are just a few words to be checked, I'd stick with String.contains() as it's readable and easy to implement.

If there were many words to be checked, string search algorithms like Aho-Corasick or Rabin-Karp would be handy.

Danstahr
  • 4,190
  • 22
  • 38
  • Although I am hardly concerned with space complexity here, will not creating these additional data structure (ArrayList and String[]) be more inefficient? I am not too concerned with easy implementation. Ill figure it out ;) – awesomeniket May 27 '14 at 21:05
  • Thank you :) Ill look at the algorithm. – awesomeniket May 27 '14 at 21:29
0

This really depends on how efficient you want your algorithm. If I am to choose from the two different approaches that you currently suggest, I'd go with a simple contains() check. Regular expressions are good for matching of patterns with variations. They are overkill for an exact match scenario that you have here. In the best case the amount of time needed for compiling all the different regexes you end up with is going to make them slower than the simple contains() approach.

However, there are faster ways. For example you can split each string into its containing words and add them to a hashset(basically a set that is implemented as a hashtable). Then you would use the intersect operation of the hashset(worst case O(n)) to get the common words. This is also a hashset. Then you check if these common words can be found in your list of known words(can also be a hashtable) and increase the scores. With this approach you skip all the string matchings of your proposed approach.

Farhad Alizadeh Noori
  • 2,276
  • 17
  • 22
  • Can I just use Arraylist instead of hashtable to store the containing words? I dont see why I need to pair here. I was thinking on the lines of: Collection firstString = new ArrayList(); words.add(*); similarly for the other String. Then use firstString.retainAll(secondString); ... Is this less optimal? – awesomeniket May 28 '14 at 14:01
  • @awesomeniket You absolutely don't have to but it would be less optimal. The reason I brought up hashset is that a retainAll or an intersect operation on a hashset is of order O(n) and not O(n^2). However, if your strings don't contain more than let's say 20 words on average these two approaches wouldn't be that different. Feel free to use the more familiar approach but have the other more sophisticated approach in mind. – Farhad Alizadeh Noori May 28 '14 at 14:10
  • I didnt figure it would be order of O(n^2) :p .. I will use hashset. Thank you again! :) – awesomeniket May 28 '14 at 14:17
  • I have another question. Hashmap what will be the keys if values are the words. – awesomeniket May 28 '14 at 18:11
  • Well a hashset is a group of keys. There is no value associated with a key. The only reason you would use a hashset it to be able to efficiently check if a key exists in the said set. You can use the [HashSet](http://docs.oracle.com/javase/7/docs/api/java/util/HashSet.html). – Farhad Alizadeh Noori May 28 '14 at 18:17