-1

I am trying to create a simple multithreaded dictionary/index using a group of Documents which contain words. The dictionary is stored in a ConcurrentHashMap with String keys and Vector values. For each word in the dictionary there is an appearance list which is a vector with a series of Tuple objects (custom object).( Tuple is a combination of 2 numbers in my case).

Each thread takes one document as input, finds all the words in it and tries to update the ConcurrentHashMap. Also, i have to point out that 2 threads may try to update the same key of the Map by adding on its value, a new Tuple. I only do write operations on the Vector.

Below you can see the code for submitting new threads. As you can see i give as input the dictionary which is a ConcurrentHashMap with String keys and Vector values

public void run(Crawler crawler) throws InterruptedException {
        while (!crawler.getFinishedPages().isEmpty()) {
            this.INDEXING_SERVICE.submit(new IndexingTask(this.dictionary, sources, 
                                                          crawler.getFinishedPages().take()));
        }
        this.INDEXING_SERVICE.shutdown();
}

Below you can see the code of and indexing thread :

public class IndexingTask implements Runnable {

    private ConcurrentHashMap<String, Vector<Tuple>> dictionary;
    private HtmlDocument document;

    public IndexingTask(ConcurrentHashMap<String, Vector<Tuple>> dictionary,
                        ConcurrentHashMap<Integer, String> sources, HtmlDocument document) {
        this.dictionary = dictionary;
        this.document = document;
        sources.putIfAbsent(document.getDocId(), document.getURL());
    }

    @Override
    public void run() {

        for (String word : document.getTerms()) {

            this.dictionary.computeIfAbsent(word, k -> new Vector<Tuple>())
                    .add(new Tuple(document.getDocId(), document.getWordFrequency(word)));

        }
    }
}

The code seems to be correct but the dictionary is not updated properly. I mean some words (keys) are missing from the original dictionary and some other keys have less items in their Vector.

I have done some debugging and i found out that before a thread instance is terminated, it has calculated the correct keys and values. Though the original dictionary which is given in the thread as input (look on the first piece of code) is not updated correctly.Do you have any idea or suggestion?

  • I don't understand. From the little I can see, you have a dictionary per thread. Why does this even involve concurrency at all? – devoured elysium Dec 26 '19 at 21:32
  • @devouredelysium the call to `new IndexingTask(...)` passes in the common dictionary to use. – Jason Dec 26 '19 at 22:07
  • Fair enough. From the little you've shown it seems fine to me. – devoured elysium Dec 26 '19 at 22:14
  • 2
    Btw, you're probably aware of that, but by the moment you call `.shutdown()` on an executor, the thread won't block unless you use an `awaitTermination()`: "This method does not wait for previously submitted tasks to complete execution. Use awaitTermination to do that."" – devoured elysium Dec 26 '19 at 22:18
  • Plus, you should mark your fields as `final`. – devoured elysium Dec 26 '19 at 22:19
  • Well you have guessed correct by far. Finishedpages is a blockingqueue with HtmlDocuments. It works correctly as i have tested it. @devouredelysium thank you for indicating me awaitTermination(). This might prevent the correct update of the common dictionary. I try to use awaitTermination() – Erodotos Demetriou Dec 27 '19 at 02:05

1 Answers1

0

when you call this.INDEXING_SERVICE.shutdown() may 'IndexingTask' has not run yet, I updated your code:

import java.util.Arrays;
import java.util.List;
import java.util.Vector;
import java.util.concurrent.*;
import java.util.concurrent.atomic.AtomicInteger;

class Tuple {
    private Integer key;
    private String value;

    public Tuple(Integer key, String value) {
        this.key = key;
        this.value = value;
    }

    @Override
    public String toString() {
        return "(" + key + ", " + value + ")";
    }
}

class HtmlDocument {

    private int docId;
    private String URL;
    private List<String> terms;

    public int getDocId() {
        return docId;
    }

    public void setDocId(int docId) {
        this.docId = docId;
    }

    public String getURL() {
        return URL;
    }

    public void setURL(String URL) {
        this.URL = URL;
    }

    public List<String> getTerms() {
        return terms;
    }

    public void setTerms(List<String> terms) {
        this.terms = terms;
    }

    public String getWordFrequency(String word) {
        return "query";
    }
}

class IndexingTask implements Runnable {

    private ConcurrentHashMap<String, Vector<Tuple>> dictionary;
    private HtmlDocument document;

    public IndexingTask(ConcurrentHashMap<String, Vector<Tuple>> dictionary,
                        ConcurrentHashMap<Integer, String> sources, HtmlDocument document) {
        this.dictionary = dictionary;
        this.document = document;
        sources.putIfAbsent(document.getDocId(), document.getURL());
    }

    @Override
    public void run() {

        for (String word : document.getTerms()) {

            this.dictionary.computeIfAbsent(word, k -> new Vector<Tuple>())
                    .add(new Tuple(document.getDocId(), document.getWordFrequency(word)));

        }
        Crawler.RUNNING_TASKS.decrementAndGet();
    }
}

class Crawler {

    protected BlockingQueue<HtmlDocument> finishedPages = new LinkedBlockingQueue<>();

    public static final AtomicInteger RUNNING_TASKS = new AtomicInteger();

    public BlockingQueue<HtmlDocument> getFinishedPages() {
        return finishedPages;
    }
}

public class ConcurrentHashMapExample {

    private ConcurrentHashMap<Integer, String> sources = new ConcurrentHashMap<>();
    private ConcurrentHashMap<String, Vector<Tuple>> dictionary = new ConcurrentHashMap<>();

    private static final ExecutorService INDEXING_SERVICE = Executors.newSingleThreadExecutor();

    public void run(Crawler crawler) throws InterruptedException {
        while (!crawler.getFinishedPages().isEmpty()) {
            Crawler.RUNNING_TASKS.incrementAndGet();
            this.INDEXING_SERVICE.submit(new IndexingTask(this.dictionary, sources,
                    crawler.getFinishedPages().take()));
        }
        //when you call ```this.INDEXING_SERVICE.shutdown()``` may 'IndexingTask' has not run yet
        while (Crawler.RUNNING_TASKS.get() > 0)
            Thread.sleep(3);
        this.INDEXING_SERVICE.shutdown();
    }

    public ConcurrentHashMap<Integer, String> getSources() {
        return sources;
    }

    public ConcurrentHashMap<String, Vector<Tuple>> getDictionary() {
        return dictionary;
    }

    public static void main(String[] args) throws Exception {
        ConcurrentHashMapExample example = new ConcurrentHashMapExample();
        Crawler crawler = new Crawler();
        HtmlDocument document = new HtmlDocument();
        document.setDocId(1);
        document.setURL("http://127.0.0.1/abc");
        document.setTerms(Arrays.asList("hello", "world"));
        crawler.getFinishedPages().add(document);
        example.run(crawler);
        System.out.println("source: " + example.getSources());
        System.out.println("dictionary: " + example.getDictionary());
    }

}

output:

source: {1=http://127.0.0.1/abc}
dictionary: {world=[(1, query)], hello=[(1, query)]}

I think, in your business, you should use the 'Producer', 'Consumer' design pattern

dung ta van
  • 988
  • 8
  • 13
  • Your solution helped me a lot. Letting threads sleep for 3 ms got more entries in my dictionary. Is not full but there is a significant difference. Producer Consumer Model is used somehow. I have another team of threads in Crawler which crawl the web and put HtmlDocuments in the blocking-queue so the Indexing Threads will consume them. I appreciate your help!! – Erodotos Demetriou Dec 27 '19 at 10:41