0

I'm creating a positional index using Java, which has the documentID and the position of the word eg : If we have a scenario which has three documents a document

String[] docs = {"put new returns between paragraphs", "houses which are new in jersey", "home sales new rise in july"}

. The positional index will have as shown below which has the [ word docID : position fo the word in the document. PS: Each phrase in the String Array is considered as a document

Desired output put 0 : 0 new 0 : 1 , 1 : 3 , 2 : 2 returns 0 : 2 ....

Here is what I have tried, But I'm unable to get the position of the word

public static void main(String[] args) {
    String[] docs = { "put new returns between paragraphs", "houses which are new in jersey", "home sales new rise in july"};
    PositionalIndex pi = new PositionalIndex(docs);
    System.out.print(pi);

}

Positional Index

public PositionalIndex(String[] docs) {

    ArrayList<Integer> docList;
    docLists = new ArrayList<ArrayList<Integer>>();
    termList = new ArrayList<String>();
    myDocs = docs;

    for (int i = 0; i < myDocs.length; i++) {
        String[] tokens = myDocs[i].split(" ");
        for (String token : tokens) {
            if (!termList.contains(token)) {// a new term
                termList.add(token);
                docList = new ArrayList<Integer>();
                docList.add(new Integer(i));
                System.out.println(docList);
                docLists.add(docList);
            } else {// an existing term

                int index = termList.indexOf(token);
                docList = docLists.get(index);
                if (!docList.contains(new Integer(i))) {
                    docList.add(new Integer(i));
                    docLists.set(index, docList);
                }
            }
        }
    }
}

Display

/**
 * Return the string representation of a positional index
 */
public String toString() {
    String matrixString = new String();
    ArrayList<Integer> docList;
    for (int i = 0; i < termList.size(); i++) {
        matrixString += String.format("%-15s", termList.get(i));
        docList = docLists.get(i);
        for (int j = 0; j < docList.size(); j++) {
            matrixString += docList.get(j) + "\t";
        }
        matrixString += "\n";
    }
    return matrixString;
}
shockwave
  • 3,074
  • 9
  • 35
  • 60

1 Answers1

1

The problem is that you are using the enhanced for loop, which hides the indices.

Change the inner loop from

for (String token : tokens) {
    ...

to

for (int j=0; j<tokens.length;j++) {
    String token = tokens[j];
    ...

and you'll have the position of the word - j.

Instead of the ArrayLists you are currently using, in order to store all the data you need in your PositionalIndex, I suggest a Map<String,Map<Integer,Integer>, where the key of the outer Map is the term (word) and the value is a Map whose key is a document's index and the value is the term's index within that document.

Eran
  • 387,369
  • 54
  • 702
  • 768