I am looking at a set of 10 PDFs, and I want to write code that will tell me the number of times a couple words I've predetermined appear in the document. So far, I've been using the pdftools function and tm function to find the frequency of most common words in the documents, but I don't know how to look for specific words. Thanks!
Asked
Active
Viewed 145 times
1 Answers
0
You can start of with pdftotext then send its output through your choice of OS string filter. On windows the better of several, in this case is Findstr
:-
Note the string count is 13 but two lines have the same word more than once so the word count would be 15 HOWEVER there are no objects called words in a PDF thats a text thing. SO just beware that short words may get you more than expected.
pdftotext filename.pdf %temp%\pdfout.txt &&echo/ &&Findstr /O /I "one word or more" %temp%\pdfout.txt
For multiple files, simply wrap that in a "for" loop. On windows see For /?

K J
- 8,045
- 3
- 14
- 36