Java - Extracting non duplicate words from PDF files

Question

I wrote a simple program in Java using PDFBox to extract words from a PDF file. It reads the text from PDF and extract word by word.

public class Main {

    public static void main(String[] args) throws Exception {
        try (PDDocument document = PDDocument.load(new File("C:\\my.pdf"))) {

            if (!document.isEncrypted()) {

                PDFTextStripper tStripper = new PDFTextStripper();
                String pdfFileInText = tStripper.getText(document);
                String lines[] = pdfFileInText.split("\\r?\\n");
                for (String line : lines) {
                    System.out.println(line);
                }

            }
        } catch (IOException e){
            System.err.println("Exception while trying to read pdf document - " + e);
        }
    }

}

Is there a way to extract the words without duplicates?

In general, you can use a Set to achieve that, something like this: Set words = new HashSet(); then you can add each word to the set *set.add(word)*, it will ignore the duplicated word, after that you can go through the set again to get all the words that are non-duplicated words. — No Em, Oct 09 '18 at 04:02
// hold all non-duplicated words Set uniqueWords = new HashSet(); for (String line : lines) { String[] words = line.split(" "); for (String word : words) { uniqueWords.add(word.trim()); } } // print all non-duplicated words System.out.println("Non-duplicated words: "); Iterator it = uniqueWords.iterator(); while(it.hasNext()){ System.out.println(it.next()); } — No Em, Oct 09 '18 at 04:09

score 3 · Accepted Answer · answered Oct 09 '18 at 03:59

3

Split each line by space - line.split(" ")
Maintain a HashSet to hold these words and keep adding all the words to it.

HashSet by its nature will ignore the duplicates.

HashSet<String> uniqueWords = new HashSet<>();

for (String line : lines) {
  String[] words = line.split(" ");

  for (String word : words) {
    uniqueWords.add(word);
  }
}

answered Oct 09 '18 at 03:59

Rishikesh Dhokare

3,559
23
34

So I need to create one ? How to extract words to Hashset then? – TomCold Oct 09 '18 at 04:01
When i try to print uniqueWords, I could still see duplicates in each key – TomCold Oct 09 '18 at 04:10
After storing in hashSet, would it be possible to store these "words" in a database like MYSQL for full text indexing? – TomCold Oct 09 '18 at 04:16

score 0 · Answer 2 · answered Oct 09 '18 at 04:24

0

If your goal is to remove duplicates, then one way you can achieve it is by adding the array into a java.util.Set. So right now, what you just need to do is this:

Set<String> noDuplicates = new HashSet<>( Arrays.asList( lines ) );

No more duplicates.

answered Oct 09 '18 at 04:24

Rigo Sarmiento

434
5
21

How would I store these words in hash to MySQL table? – TomCold Oct 09 '18 at 06:32
That's a different problem. – Rigo Sarmiento Oct 10 '18 at 08:10

Java - Extracting non duplicate words from PDF files

2 Answers2