0

I wrote a simple program in Java using PDFBox to extract words from a PDF file. It reads the text from PDF and extract word by word.

public class Main {

    public static void main(String[] args) throws Exception {
        try (PDDocument document = PDDocument.load(new File("C:\\my.pdf"))) {

            if (!document.isEncrypted()) {

                PDFTextStripper tStripper = new PDFTextStripper();
                String pdfFileInText = tStripper.getText(document);
                String lines[] = pdfFileInText.split("\\r?\\n");
                for (String line : lines) {
                    System.out.println(line);
                }

            }
        } catch (IOException e){
            System.err.println("Exception while trying to read pdf document - " + e);
        }
    }

}

Is there a way to extract the words without duplicates?

TomCold
  • 83
  • 6
  • In general, you can use a Set to achieve that, something like this: Set words = new HashSet(); then you can add each word to the set *set.add(word)*, it will ignore the duplicated word, after that you can go through the set again to get all the words that are non-duplicated words. – No Em Oct 09 '18 at 04:02
  • @NoEm How would that look in the code? – TomCold Oct 09 '18 at 04:03
  • // hold all non-duplicated words Set uniqueWords = new HashSet(); for (String line : lines) { String[] words = line.split(" "); for (String word : words) { uniqueWords.add(word.trim()); } } // print all non-duplicated words System.out.println("Non-duplicated words: "); Iterator it = uniqueWords.iterator(); while(it.hasNext()){ System.out.println(it.next()); } – No Em Oct 09 '18 at 04:09
  • You could post it as an answer instead – TomCold Oct 09 '18 at 04:54

2 Answers2

3
  1. Split each line by space - line.split(" ")
  2. Maintain a HashSet to hold these words and keep adding all the words to it.

HashSet by its nature will ignore the duplicates.

HashSet<String> uniqueWords = new HashSet<>();

for (String line : lines) {
  String[] words = line.split(" ");

  for (String word : words) {
    uniqueWords.add(word);
  }
}
Rishikesh Dhokare
  • 3,559
  • 23
  • 34
0

If your goal is to remove duplicates, then one way you can achieve it is by adding the array into a java.util.Set. So right now, what you just need to do is this:

Set<String> noDuplicates = new HashSet<>( Arrays.asList( lines ) );

No more duplicates.

Rigo Sarmiento
  • 434
  • 5
  • 21