Massive multiprogramming and read-only file access

Question

I am trying to create a dictionary-based tagger running on a Hadoop cluster using Pig. Basically, what it does, is for each document (quite large text documents, up to a few MBs) to run each word in each sentence against the dictionary to read the corresponding value.

There will be up to a few hundred java programs (not threads) running in parallel, using the dictionary file in read-only mode. The idea is to load the dictionary from text and create a Map to query against it.

Question: what should I be prepared for? Is it even remotely logic to want to read a file in a multiprogramming environment or should I first copy the (relatively small) file for each instance of the program? Is a BufferedReader something I should use while reading the file?

There is very little structured documentation on multiprogramming (compared to multithreading) so I am a bit afraid of running against a wall by doing so.

Note: you are only allowed to answer that my way of thinking is totally wrong if you provide me with a better way ;-)

[This](http://stackoverflow.com/a/5800450/2071828) might be relevant to you. — Boris the Spider, May 02 '13 at 17:11
It sounds like you're looking for something more real-time than Hadoop... — Eli, May 06 '13 at 19:28
Note sure if answers your question but We had couple of similar cases. One in which we had multiple mappers in hadoop traverse a large binary tree (fixed). We did this by using Memory mapping binary tree. Other use case was to analyze domain names against a dictionary. Here we used lucene index which shared by multiple mappers by using MMAPDirectory. — satish, Jul 02 '13 at 23:28

score 0 · Answer 1 · answered Aug 15 '13 at 21:25

0

I think your approach is fine. You should load your dictionary from the DistributedCache to memory, and do the checks with the memory-loaded dictionary (e.g., a HashMap).

answered Aug 15 '13 at 21:25

cabad

4,555
1
20
33

Massive multiprogramming and read-only file access

1 Answers1