0

I have a String that I need to search for in a collection of Strings. I'll need to do searches for multiple representations of the required String(original representation, trimmed, UTF-8 encoded, non ASCII characters encoded). The collection size will be in the order of thousands.

I'm trying to figure out what's the best representation to use for the collection in order to have the best performance:

  1. ArrayList - iterate over the array and check if any of the elements match any of the Strings representations
  2. HashMap - check if map contains any of my Strings representation
  3. Any other?
Mircea Badescu
  • 291
  • 3
  • 7
  • 16
  • 2
    Why `HashMap` and not `HashSet`? – Tom Sep 09 '16 at 12:38
  • @Tom.. A HashSet does use a HashMap to back its implementation – Saurav Sahu Sep 09 '16 at 12:41
  • If you use a List, you don't need to iterate over the array. Just use contains with the various forms of the string. – Jacob Sep 09 '16 at 12:41
  • For each required String there will be limited number of allowed representations? For example 5 or 7 representations, right? Provide an example of representation. Another question is how often this set will be searched. If there will be 1000 Strings, how many queries overall you expect? – SergeyS Sep 09 '16 at 12:41
  • Write your algorithm for searching first. That'll tell you what data-type to use. Optimize on that afterwards. – Balkrishna Rawool Sep 09 '16 at 12:41
  • @SauravSahu That doesn't mean that OP needs to handle the Map himself. Better argument? – Tom Sep 09 '16 at 12:44
  • - Use ArrayList only for itaraions. - Use HashMap or HashSet for searchs. - Use LinkedList if you need to remove elements at the beginning or middle of the collection. **In your case I suggest HashMap or HashSet because of searches.** – Ady Junior Sep 09 '16 at 12:45
  • The collection will contain around 5000 elements; I have 4 representations for the String(original representation, trimmed, UTF-8 encoded, non ASCII characters encoded). There will only be one query - to identify whether the collection contains any of those 4 representations. The current code stores the collection in an ArrayList and iterates over it at least once(if the first representation is found) up to four times (trying to find one of the other representations). I'm trying to find a better way to do it. – Mircea Badescu Sep 09 '16 at 12:46
  • if you can show sample output you need and from where you need and what is your approach and then the help you need will be great and easy otherwise this debate can go on till apocalypse happens ;) – bananas Sep 09 '16 at 12:53
  • check this link out, it may be useful: http://stackoverflow.com/questions/18564744/fastest-way-to-find-strings-in-string-collection-that-begin-with-certain-chars – nihirus Sep 09 '16 at 13:56
  • "4 representations for the String(original representation, trimmed, UTF-8 encoded, non ASCII characters encoded)": Do mean `java.lang.String`? That's always UTF-16, no matter how it got that way. – Tom Blodget Sep 09 '16 at 16:21

1 Answers1

0

Generally speaking, HashMap (or any other hashtable-based data structure) is much more preferred for "lookup" exercise. The reason is simple, those data structures support lookup in constant time (independent of collection size). But... in your scenario (single query for collection), you probably will not gain any performance improvements from using HashMap instead of ArrayList. Reasons:

  1. Putting data inside HashMap will take some time. Not significant time, but comparable to one full pass of the initial list.
  2. Your collection is pretty small - iterating over 5000 of elements is a matter of couple milliseconds (or faster?). Since you need to "search" only once, you will not save much time on that.
SergeyS
  • 3,515
  • 18
  • 27