3

I am trying to create a dictionary-based tagger running on a Hadoop cluster using Pig. Basically, what it does, is for each document (quite large text documents, up to a few MBs) to run each word in each sentence against the dictionary to read the corresponding value.

There will be up to a few hundred java programs (not threads) running in parallel, using the dictionary file in read-only mode. The idea is to load the dictionary from text and create a Map to query against it.

Question: what should I be prepared for? Is it even remotely logic to want to read a file in a multiprogramming environment or should I first copy the (relatively small) file for each instance of the program? Is a BufferedReader something I should use while reading the file?

There is very little structured documentation on multiprogramming (compared to multithreading) so I am a bit afraid of running against a wall by doing so.

Note: you are only allowed to answer that my way of thinking is totally wrong if you provide me with a better way ;-)

ATN
  • 665
  • 8
  • 26
  • 2
    [This](http://stackoverflow.com/a/5800450/2071828) might be relevant to you. – Boris the Spider May 02 '13 at 17:11
  • It sounds like you're looking for something more real-time than Hadoop... – Eli May 06 '13 at 19:28
  • Note sure if answers your question but We had couple of similar cases. One in which we had multiple mappers in hadoop traverse a large binary tree (fixed). We did this by using Memory mapping binary tree. Other use case was to analyze domain names against a dictionary. Here we used lucene index which shared by multiple mappers by using MMAPDirectory. – satish Jul 02 '13 at 23:28

1 Answers1

0

I think your approach is fine. You should load your dictionary from the DistributedCache to memory, and do the checks with the memory-loaded dictionary (e.g., a HashMap).

cabad
  • 4,555
  • 1
  • 20
  • 33