Fast loading of elements into array/list with fixed index without duplicates

Question

My requirement is to enter strings into an array which are not in the array. I also need to maintain fixed indexes, as this array will be used with other data structure with a one-to-one relation with each index. At present i am using the ArrayList class and checking with the indexOf () method to check if it exists first, if not then add it into the list with the add () method with one argument. I am not familiar to the classes in java, and therefore could not understand how can i implement it with HashMap or something else (trie or else), which will make the loading process fast .

Do the indexOf () in ArrayList makes a sequential search ? My point is to reduce the processing time when loading the words into the array, with not inserting duplicates, and maintain fixed index of the elements. If a word tested is already in the array, then the index in which it is already inserted is required, as this index is needed to index into some other structure and do some processing. Any suggestions to make this process better?

UPDATE

There is an array, i have some documents from where i need to scan each word and find unique words in the documents. But also i need to count the number of duplicates. Stated in other way, i need to count the term frequencies of the unique terms occurring in the documents. I am maintaining a ArrayList<Integer[]> of term frequency (number of terms x number of docs). I am fetching one word and then checking if it is in the word list with the indexOf () method. If it is not present in the word list, then i am inserting the word into the list, and allocating a new row in the 2d array (the Array<Integer[]>) and then setting the count of the term element in 2d array to 1. If the word is already in the word array, then i use the index of the word in the array to index in the row of the Array<Integer[]> matrix, and use the current under processing document number to get to the cell and increment the count.

My question is to reduce the indexOf () processing time for each word i am currently using. I need to get the index of the word in the word array if it is already in there, and if it is not in there then i need to insert it into the array dynamically.

Sample Code

import java.io.*;
import java.util.ArrayList;
import static java.lang.Math.log;


class DocumentRepresentation
{
  private String dirPath;
  private ArrayList<String> fileNameVector;
  private ArrayList<String> termVector;
  private ArrayList<Integer[]> tf; /* store it in natural 2d array */
  private Integer df[]; /* do normal 1d array */
  private Double idf[]; /* do normal 1d array */
  private Double tfIdf[][]; /* do normal 2d array */

  DocumentRepresentation (String dirPath)
  {
    this.dirPath = dirPath;
    fileNameVector = new ArrayList<String> ();
    termVector = new ArrayList<String> ();
    tf = new ArrayList<Integer[]> ();
  }

  /* Later sepatere the internal works */
  public int start ()
  {
    /* Load the files, and populate the fileNameVector string */
    File fileDir = new File (dirPath);
    int fileCount = 0;
    int index;

    if (fileDir.isDirectory () == false)
    {
      return -1;
    }

    File fileList[] = fileDir.listFiles ();

    for (int i=0; i<fileList.length; i++)
    {
      if (fileList[i].isFile () == true)
      {
        fileNameVector.add (fileList[i].getName ());
        //      System.out.print ("File Name " + (i + 1) + ": " + fileList[i].getName () + "\n");
      }
    }

    fileCount = fileNameVector.size ();
    for (int i=0;i<fileNameVector.size (); i++)
    {
      System.out.print ("Name " + (i+1) + ": " + fileNameVector.get (i) + "\n");
    }

    /* Bind the files with a buffered reader */
    BufferedReader fileReaderVector[] = new BufferedReader [fileCount];
    for (int i=0; i<fileCount; i++)
    {
      try
      {
        fileReaderVector[i] = new BufferedReader (new FileReader (fileList[i]));
      }
      /* Not handled */
      catch (FileNotFoundException e)
      {
        System.out.println (e);
      }
    }

    /* Scan the term frequencies in the tf 2d array */
    for (int i=0; i<fileCount; i++)
    {
      String line;

      try
      {
            /*** THIS IS THE PLACE OF MY QUESTION **/
        while ((line = fileReaderVector[i].readLine ()) != null)
        {
          String words[] = line.split ("[\\W]");

          for (int j=0;j<words.length;j++)
          { 
            if ((index = termVector.indexOf (words[j])) != -1)
            {
              tf.get (index)[i]++;
              /* increase the tf count */
            }
            else
            {
              termVector.add (words[j]);
              Integer temp[] = new Integer [fileCount];

              for (int k=0; k<fileCount; k++)
              {
                temp[k] = new Integer (0);
              }
              temp[i] = 1;
              tf.add (temp);
              index = termVector.indexOf (words[j]);
            }

            System.out.println (words[j]);
          }
        }
      }
      /* Not handled */
      catch (IOException e)
      {
        System.out.println (e);
      }
    }

    return 0;
  }
}

class DocumentRepresentationTest
{
  public static void main (String args[])
  {
    DocumentRepresentation docSet = new DocumentRepresentation (args[0]);
    docSet.start ();
    System.out.print ("\n");
  }
}

Note: code is snipped to keep the focus on the question

It's not really clear what you mean... an example would *really* help. (And yes, `indexOf` does a linear scan.) — Jon Skeet, Feb 04 '12 at 16:07
@JonSkeet: I have edited the question, i think this will be better understandable now. — phoxis, Feb 04 '12 at 16:20
Some sample input and expected output would be helpful though... — Jon Skeet, Feb 04 '12 at 16:22
sample input is a set of text documents in a directory, and sample output is the term frequencies for each document (the number of times each term is occurring in the document), document frequencies (the number of documents the term appears in), inverse document frequency etc. — phoxis, Feb 04 '12 at 16:43
No, that's a *description* of the input. It also doesn't explain why you're interested in the *position* within the array or `ArrayList`. — Jon Skeet, Feb 04 '12 at 16:46
this is because, if the word is present in the word array, then i will use its index into the `tf` array with `tf.get (j)[i]` , ie the jth document and the ith term/word, and then increment the count there. The above description is exactly what i require, the tf, df, of the documents and from there i next calculate the idf and the tf-idf. — phoxis, Feb 04 '12 at 16:50
It sounds like you may be using the wrong data structures to start with, to be honest. It sounds like you want a map from "word" to "map from document to count of occurrences within that document". — Jon Skeet, Feb 04 '12 at 16:51
I want to store the term frequencies in each document which is done in the `tf` object of type `Array`. The unique word/term array is separate. I am only using the index of the word in this array to index into the `tf` th row. — phoxis, Feb 04 '12 at 17:01
@JonSkeet: probably i cannot convey the idea. I need each word related to its number of occurrences in all the documents (aggregated), and also relate each word to, how many documents does that word has occurred atleast once. These I am storing at present in the `tf` and the `df` respectively. — phoxis, Feb 04 '12 at 17:09
That sounds like you want lots of hash-based data structures, but probably not arrays. You may find [Guava](http://guava-libraries.googlecode.com)'s `MultiSet` useful too. — Jon Skeet, Feb 04 '12 at 17:18
I need to check it out, thanks for the suggestion, support and the patience ;) — phoxis, Feb 04 '12 at 17:25

NPE · Accepted Answer · 2012-02-04T16:15:08.817

LinkedHashMap can satisfy all your requirements at once, with good performance characteristics.

The keys would be your items and the values would be the indices. If you insert the elements in the order of increasing indices, then iterating over the map would also return the elements in the order of increasing indices.

Here is some sample code:

LinkedHashMap<Item,Integer> map = new LinkedHashMap<Item,Integer>();

Get the item's index:

Integer index = map.get(item);
if (index != null) {
  // already in the map; use `index'
} else {
  // not in the map
}

Add item with the next index:

if (!map.containsKey(item)) {
  map.put(item, map.size());
}

Iterate over the elements in the order of increasing indices:

for (Entry<Item,Integer> e : map.entrySet()) {
  Item item = e.getKey();
  int index = e.getValue();
  ...
}

What this can't do efficiently is get the value at the specific index, but my reading of your question indicates that you don't actually need this.

At the processing phase i do not need to read the values. Once the other structures are populated i can simply load the list of strings in a `String []`. Therefore it seems that the above solution would be a good one. — phoxis, Feb 04 '12 at 16:47

score 1 · Answer 2 · answered Feb 04 '12 at 16:14

ArrayList.indexOf() does a linear search, so it's O(n).

If it really has to go into an ArrayList, you could create two collections, ArrayList and HashSet. Add and remove elements to both collections. Before adding, call HashSet.contains() to see if the element already exists.

Encapsulate your ArrayList and HashSet in its own class.

score 0 · Answer 3 · answered Feb 04 '12 at 16:14

0

If you want to leave the ArrayList you can have an HashSet as support, with the cost of the double of the memory.

You can use HashSet.add() if return true you can add also the element to the ArrayList

answered Feb 04 '12 at 16:14

zambotn

735
1
7
20

i think `HashSet` will not get my index requirement fulfilled, will it ? – phoxis Feb 04 '12 at 16:41
the only `HashSet` no... when i said to use it as *support* i mean use that too... `ArrayList` couse the index order and `HashSet` for the unique item request... My solution and Steve Kuo's ones is _almost the same_, unless that my command could do both adding element to `HashSet` and knowing if you have to add that to the `ArrayList` – zambotn Feb 04 '12 at 18:56

Fast loading of elements into array/list with fixed index without duplicates

3 Answers3