Java Inverted Index program

Question

I am writing an inverted index program on java which returns the frequency of terms among multiple documents. I have been able to return the number times a word appears in the entire collection, but I have not been able to return which documents the word appears in. This is the code I have so far:

import java.util.*;  // Provides TreeMap, Iterator, Scanner  
import java.io.*;    // Provides FileReader, FileNotFoundException  

public class Run
{
    public static void main(String[ ] args)
    {
        // **THIS CREATES A TREE MAP**  
        TreeMap<String, Integer> frequencyData = new TreeMap<String, Integer>( );

        Map[] mapArray = new Map[5];
        mapArray[0] = new HashMap<String, Integer>();

        readWordFile(frequencyData);
        printAllCounts(frequencyData);
    }


    public static int getCount(String word, TreeMap<String, Integer> frequencyData)
    {
        if (frequencyData.containsKey(word))
        {  // The word has occurred before, so get its count from the map  
            return frequencyData.get(word); // Auto-unboxed  
        }
        else
        {  // No occurrences of this word  
            return 0;
        }
    }


    public static void printAllCounts(TreeMap<String, Integer> frequencyData)
    {
        System.out.println("-----------------------------------------------");
        System.out.println("    Occurrences    Word");

        for(String word : frequencyData.keySet( ))
        {
            System.out.printf("%15d    %s\n", frequencyData.get(word), word);
        }

        System.out.println("-----------------------------------------------");
    }


    public static void readWordFile(TreeMap<String, Integer> frequencyData)
    {
        int total = 0;
        Scanner wordFile;
        String word;     // A word read from the file  
        Integer count;   // The number of occurrences of the word
        int counter = 0;
        int docs = 0;

        //**FOR LOOP TO READ THE DOCUMENTS**  
        for(int x=0; x<Docs.length; x++)
        { //start of for loop [*  

            try
            {
                wordFile = new Scanner(new FileReader(Docs[x]));
            }
            catch (FileNotFoundException e)
            {
                System.err.println(e);
                return;
            }

            while (wordFile.hasNext( ))
            {
                // Read the next word and get rid of the end-of-line marker if needed:  
                word = wordFile.next( );

                // This makes the Word lower case.  
                word = word.toLowerCase();

                word = word.replaceAll("[^a-zA-Z0-9\\s]", "");

                // Get the current count of this word, add one, and then store the new count:  
                count = getCount(word, frequencyData) + 1;
                frequencyData.put(word, count);
                total = total + count;
                counter++;
                docs = x + 1;

            }

        } //End of for loop *]  
        System.out.println("There are " + total + " terms in the collection.");
        System.out.println("There are " + counter + " unique terms in the collection.");
        System.out.println("There are " + docs + " documents in the collection.");

    }


    // Array of documents  
    static String Docs [] = {"words.txt", "words2.txt",};

Bobulous · Answer 1 · 2014-05-10T23:23:50.427

Instead of simply having a Map from word to count, create a Map from each word to a nested Map from document to count. In other words:

Map<String, Map<String, Integer>> wordToDocumentMap;

Then, inside your loop which records the counts, you want to use code which looks like this:

Map<String, Integer> documentToCountMap = wordToDocumentMap.get(currentWord);
if(documentToCountMap == null) {
    // This word has not been found anywhere before,
    // so create a Map to hold document-map counts.
    documentToCountMap = new TreeMap<>();
    wordToDocumentMap.put(currentWord, documentToCountMap);
}
Integer currentCount = documentToCountMap.get(currentDocument);
if(currentCount == null) {
    // This word has not been found in this document before, so
    // set the initial count to zero.
    currentCount = 0;
}
documentToCountMap.put(currentDocument, currentCount + 1);

Now you're capturing the counts on a per-word and per-document basis.

Once you've completed the analysis and you want to print a summary of the results, you can run through the map like so:

for(Map.Entry<String, Map<String,Integer>> wordToDocument :
        wordToDocumentMap.entrySet()) {
    String currentWord = wordToDocument.getKey();
    Map<String, Integer> documentToWordCount = wordToDocument.getValue();
    for(Map.Entry<String, Integer> documentToFrequency :
            documentToWordCount.entrySet()) {
        String document = documentToFrequency.getKey();
        Integer wordCount = documentToFrequency.getValue();
        System.out.println("Word " + currentWord + " found " + wordCount +
                " times in document " + document);
    }
}

For an explanation of the for-each structure in Java, see this tutorial page.

For a good explanation of the features of the Map interface, including the entrySet method, see this tutorial page.

I am getting an error for (String wordToDocumentMaps : wordToDocumentMap.get(word)) { System.out.println(wordToDocumentMaps); } — user3600008, May 06 '14 at 03:06
Updated answer to include code which runs through the map to print out the frequency of each word in each document. — Bobulous, May 10 '14 at 23:24

Alexey Malev · Answer 2 · 2014-05-03T21:52:52.520

1

Try adding second map word -> set of document name like this:

Map<String, Set<String>> filenames = new HashMap<String, Set<String>>();

...
word = word.replaceAll("[^a-zA-Z0-9\\s]", ""); 

// Get the current count of this word, add one, and then store the new count:  
count = getCount(word, frequencyData) + 1;  
frequencyData.put(word, count);
Set<String> filenamesForWord = filenames.get(word);
if (filenamesForWord == null) {
    filenamesForWord = new HashSet<String>();
}
filenamesForWord.add(Docs[x]);
filenames.put(word, filenamesForWord);
total = total + count;
counter++;
docs = x + 1;

When you need to get a set of filenames in which you encountered a particular word, you'll just get() it from the map filenames. Here is the example that prints out all the file names, in which we have encountered a word:

public static void printAllCounts(TreeMap<String, Integer> frequencyData, Map<String, Set<String>> filenames) {
    System.out.println("-----------------------------------------------");
    System.out.println("    Occurrences    Word");

    for(String word : frequencyData.keySet( ))
    {
        System.out.printf("%15d    %s\n", frequencyData.get(word), word);
        for (String filename : filenames.get(word)) {
            System.out.println(filename);
        } 
    }

    System.out.println("-----------------------------------------------");
}

edited May 03 '14 at 21:52

answered May 03 '14 at 20:49

Alexey Malev

6,408
4
34
52

1

The problem with this approach is that you'll know that a word appears in a document, but not how many times it appears in that specific document. – Bobulous May 03 '14 at 20:57
@Arkanon The idea remains the same, you can modify maps to save whatever information you need. This post might be considered as an example. – Alexey Malev May 03 '14 at 21:06
@user3600008 Assuming you need to get a set of files which contains word stored in `String` variable `word`, you'll need: `Set filenamesForWord = filenames.get(word);`. – Alexey Malev May 03 '14 at 21:07
@user3600008 Sorry, your question is unclear. What exactly do you want to print? – Alexey Malev May 03 '14 at 21:22
@AlexeyMalev I want to print which documents a certain word is present in. For an example: apple [words.txt, words2.txt] – user3600008 May 03 '14 at 21:37
@user3600008 I updated the answer with this example. – Alexey Malev May 03 '14 at 21:39
@AlexeyMalev how can i implement that in my printAllCounts method? – user3600008 May 03 '14 at 21:50
@user3600008 Updated answer once again. – Alexey Malev May 03 '14 at 21:53

score 1 · Answer 3 · edited Feb 14 '18 at 18:52

I've put a scanner into the main methode, and the word I search for will return the documents the word occurce in. I also return how many times the word occurs, but I will only get it to be the total of times in all of three documents. And I want it to return how many times it occurs in each document. I want this to be able to calculate tf-idf, if u have a total answer for the whole tf-idf I would appreciate. Cheers

Here is my code:

import java.util.*;  // Provides TreeMap, Iterator, Scanner  
import java.io.*;    // Provides FileReader, FileNotFoundException  

public class test2
{

    public static void main(String[ ] args)
    {
        // **THIS CREATES A TREE MAP**  
        TreeMap<String, Integer> frequencyData = new TreeMap<String, Integer>();
        Map<String, Set<String>> filenames = new HashMap<String, Set<String>>();
        Map<String, Integer> countByWords = new HashMap<String, Integer>();

        Map[] mapArray = new Map[5];
        mapArray[0] = new HashMap<String, Integer>();

        readWordFile(countByWords, frequencyData, filenames);
        printAllCounts(countByWords, frequencyData, filenames);
    }


    public static int getCount(String word, TreeMap<String, Integer> frequencyData)
    {

        if (frequencyData.containsKey(word))
        {  // The word has occurred before, so get its count from the map  
            return frequencyData.get(word); // Auto-unboxed  
        }
        else
        {  // No occurrences of this word  
            return 0;
        }
    }



    public static void printAllCounts(  Map<String, Integer> countByWords, TreeMap<String, Integer> frequencyData, Map<String, Set<String>> filenames)
    {
        System.out.println("-----------------------------------------------");
        System.out.print("Search for a word: ");

        String worde;
        int result = 0;
        Scanner input = new Scanner(System.in);
        worde=input.nextLine();

        if(!filenames.containsKey(worde)){
            System.out.println("The word does not exist");
        }

        else{
            for(String filename : filenames.get(worde)){


                System.out.println(filename);
                System.out.println(countByWords.get(worde));




            }


        }

        System.out.println("\n-----------------------------------------------");
    }


    public static void readWordFile(Map<String, Integer> countByWords ,TreeMap<String, Integer> frequencyData, Map<String, Set<String>> filenames)
    {
        Scanner wordFile;
        String word;     // A word read from the file  
        Integer count; // The number of occurrences of the word
        int counter = 0;

        int docs = 0;

        //**FOR LOOP TO READ THE DOCUMENTS**  
        for(int x=0; x<Docs.length; x++)
        { //start of for loop [*  

            try
            {
                wordFile = new Scanner(new FileReader(Docs[x]));
            }
            catch (FileNotFoundException e)
            {
                System.err.println(e);
                return;
            }

            while (wordFile.hasNext( ))
            {
                // Read the next word and get rid of the end-of-line marker if needed:  
                word = wordFile.next( );

                // This makes the Word lower case.  
                word = word.toLowerCase();

                word = word.replaceAll("[^a-zA-Z0-9\\s]", "");

                // Get the current count of this word, add one, and then store the new count:  
                count = countByWords.get(word);
                if(count != null){
                    countByWords.put(word, count + 1);
                }



                else{
                    countByWords.put(word, 1);
                }
                Set<String> filenamesForWord = filenames.get(word);
                if (filenamesForWord == null) {
                    filenamesForWord = new HashSet<String>();

                }

                filenamesForWord.add(Docs[x]);
                filenames.put(word, filenamesForWord);
                counter++;
                docs = x + 1;

            }




        } //End of for loop *]  
        System.out.println("There are " + counter + " terms in the collection.");
        System.out.println("There are " + docs + " documents in the collection.");

    }


    // Array of documents  
    static String Docs [] = {"Document1.txt", "Document2.txt", "Document3.txt"};

}

Java Inverted Index program

3 Answers3