The probability distribution of two words in a file using java 8

Question

I need the number of lines that contain two words. For this purpose I have written the following code: The input file contains 1000 lines and about 4,000 words, and it takes about 4 hours. Is there a library in Java that can do it faster? Can I implement this code using Appache Lucene or Stanford Core NLP to achieve less run time?

ArrayList<String> reviews = new ArrayList<String>();
ArrayList<String> terms = new ArrayList<String>();
Map<String,Double> pij = new HashMap<String,Double>();

BufferedReader br = null;
FileReader fr = null;
try 
    {
        fr = new FileReader("src/reviews-preprocessing.txt");
            br = new BufferedReader(fr);
            String line;
            while ((line= br.readLine()) != null) 
            {
            for(String term : line.split(" "))
                {
                    if(!terms.contains(term))
                        terms.add(term);
                }
                reviews.add(line);
            }
        } 
        catch (IOException e) { e.printStackTrace();} 
        finally 
        {
            try 
            {
                if (br != null)
                    br.close();
                if (fr != null)
                    fr.close();
            } 
            catch (IOException ex) { ex.printStackTrace();}    
    }
long Count = reviews.size();
for(String term_i : terms)
    {
        for(String term_j : terms)
            {
                if(!term_i.equals(term_j))
                {
                    double p = (double) reviews.parallelStream().filter(s -> s.contains(term_i) && s.contains(term_j)).count();
                    String key = String.format("%s_%s", term_i,term_j);
                    pij.put(key, p/Count);
                }
            }
    }

Libraries are no magic. You code isn’t slow because you’re not using a library, it’s slow because you’re using two nested loops containing another stream operation. That is, `term.size()`×`term.size()`×`reviews.size()` operations. — Holger, Dec 13 '17 at 07:47
That's right, but this is inevitable. So I thought it might be possible to use a faster method instead of using the ParllelStream. @Holger — m.kabiri, Dec 13 '17 at 08:18
It’s not inevitable. That’s the art of developing algorithms. It’s the reason why we know so many different sorting algorithms; there are many different ways to solve the same task and you can never assume that there can’t be a better one. — Holger, Dec 13 '17 at 08:20

Holger · Accepted Answer · 2017-12-13T08:42:53.410

Your first loop getting the distinct words relies on ArrayList.contains, which has a linear time complexity, instead of using a Set. So if we assume nd distinct words, it already has a time complexity of “number of lines”×nd.

Then, you are creating nd×nd word combinations and probing all 1,000 lines for the presence of these combination. In other words, if we only assume 100 distinct words, you are performing 1,000×100 + 100×100×1,000 = 10,100,000 operations, if we assume 500 distinct words, we’re talking about 250,500,000 already.

Instead, you should just create the combinations actually existing in a line and collect them into the map. This will only process those combinations actually existing and you may improve this by only checking either of each “a_b”/“b_a” combination, as the probability of both is identical. Then, you are only performing “number of lines”×“word per line”×“word per line” operations, in other words, roughly 16,000 operations in your case.

The following method combines all words of a line, only keeping one of the “a_b”/“b_a” combination, and eliminates duplicates so each combination can count as a line.

static Stream<String> allCombinations(String line) {
    String[] words = line.split(" ");
    return Arrays.stream(words)
        .flatMap(word1 ->
            Arrays.stream(words)
                  .filter(words2 -> word1.compareTo(words2)<0)
                  .map(word2 -> word1+'_'+word2))
        .distinct();
}

This method can be use like

List<String> lines = Files.readAllLines(Paths.get("src/reviews-preprocessing.txt"));
double ratio = 1.0/lines.size();
Map<String, Double> pij = lines.stream()
        .flatMap(line -> allCombinations(line))
        .collect(Collectors.groupingBy(Function.identity(),
                                       Collectors.summingDouble(x->ratio)));

It ran through my copy of “War and Peace” within a few seconds, without needing any attempt to do parallel processing. Not much surprising, “and_the” was the combination with the highest probability.

You may consider changing the line

String[] words = line.split(" ");

to

String[] words = line.toLowerCase().split("\\W+");

to generalize the code to work with different input, handling multiple spaces or other punctuation characters and ignoring the case.

there is an opinion that the actual book name should have been `war and humanity(planet, light, earth)` etc - as in a word that does NOT define `!= war && == peace`. originally it was written as `мiръ` (which is `!= peace`). This now is seen as either a typo in the first printed book or a word that is != peace, still the name will probably leave on as "war and peace" anyway — Eugene, Dec 13 '17 at 08:37
@Eugene: I wasn’t aware that there were two distinct words for world and peace before the revolution; I only knew the “мир” meaning both. But anyway, your assumption is right, we will continue using the well-known name to be sure that the reader also knows what we’re talking about… — Holger, Dec 13 '17 at 08:49

The probability distribution of two words in a file using java 8

1 Answers1