I am trying to create a dictionary-based tagger running on a Hadoop cluster using Pig. Basically, what it does, is for each document (quite large text documents, up to a few MBs) to run each word in each sentence against the dictionary to read the corresponding value.
There will be up to a few hundred java programs (not threads) running in parallel, using the dictionary file in read-only mode. The idea is to load the dictionary from text and create a Map
to query against it.
Question: what should I be prepared for? Is it even remotely logic to want to read a file in a multiprogramming environment or should I first copy the (relatively small) file for each instance of the program? Is a
BufferedReader
something I should use while reading the file?
There is very little structured documentation on multiprogramming (compared to multithreading) so I am a bit afraid of running against a wall by doing so.
Note: you are only allowed to answer that my way of thinking is totally wrong if you provide me with a better way ;-)