How to find freqeuntly occuring phrases in a text document

Question

I have a text document that has multiple paragraphs. I need to find frequently occurring phrases together.

For example

Patient name xyz phone no 12345 emailid xyz@abc.com Patient name abc address some us address

Comparing these lines the common phrase is Patient name. Now I can have the phrase anywhere in the paragraph. Now my requirement is to find the most frequently occurring phrases in the document irrespective of its position using nlp.

Guiem Bosch · Accepted Answer · 2018-03-14T11:15:34.127

You should use n-grams for that matter so you just count the number of times a sequence of contiguous n words appear. Because you don't know how many words will be repeating you can try several n for n-grams, ie. from 2 to 6.

Java ngrams example tested on JDK 1.8.0:

import java.util.*;

public class NGramExample{

    public static HashMap<String, Integer> ngrams(String text, int n) {
        ArrayList<String> words = new ArrayList<String>();
        for(String word : text.split(" ")) {
            words.add(word);
        }

        HashMap<String, Integer> map = new HashMap<String, Integer>();

        int c = words.size();
        for(int i = 0; i < c; i++) {
            if((i + n - 1) < c) {
                int stop = i + n;
                String ngramWords = words.get(i);

                for(int j = i + 1; j < stop; j++) {
                    ngramWords +=" "+ words.get(j);
                }
                map.merge(ngramWords, 1, Integer::sum);
            }
        }

        return map;
    }

     public static void main(String []args){
        System.out.println("Ngrams: ");
        HashMap<String, Integer> res = ngrams("Patient name xyz phone no 12345 emailid xyz@abc.com. Patient name abc address some us address", 2);
        for (Map.Entry<String, Integer> entry : res.entrySet()) {
            System.out.println(entry.getKey() + ":" + entry.getValue().toString());
        }
     }
}

The output:

Ngrams: 
name abc:1
xyz@abc.com. Patient:1
emailid xyz@abc.com.:1
phone no:1
12345 emailid:1
Patient name:2
xyz phone:1
address some:1
us address:1
name xyz:1
some us:1
no 12345:1
abc address:1

So you see how 'Patient name' has the max count, 2 times. You could use this function with several n values and retrieve the max occurrences.

Edit: I will leave this Python code here for historic reasons.

A simple Python (using nltk) working example to show you what I mean:

from nltk import ngrams
from collections import Counter

paragraph = 'Patient name xyz phone no 12345 emailid xyz@abc.com. Patient name abc address some us address'
n = 2
words = paragraph.split(' ') # of course you should split sentences in a better way
bigrams = ngrams(words, n)
c = Counter(bigrams)
c.most_common()[0]

This gives you the output:

>> (('Patient', 'name'), 2)

Thanks mate for you answer but can we please help me on this in Java. Btw I am searching for this in Java. — user7699179, Mar 14 '18 at 10:25
okaaay, I created an example with Java. Looong time no use Java so don't pay attention to cleanliness, but I made sure to execute the code and check it works ;-) — Guiem Bosch, Mar 14 '18 at 11:18

How to find freqeuntly occuring phrases in a text document

1 Answers1