Obtaining the PDF from internet is called scraping. Trying to read the PDF to obtain data from it is quite another problem!
There are many utilities available which try to convert PDF to text - not entirely successful. As this article explains, PDF files are nice to use (look at), but the internals aren't nearly as elegant. The reason is that the visible text, is frequently not present directly inside the document, and has to be reconstructed from tables. In some cases the PDF doesn't even contain the text, but is just an image of a text.
The article contains several tools to (try to) convert PDF to text. Some have 'wrappers' in Python to access them. There are a few modules which sound interesting, such as PyPDF (which does not convert to text), but really aren't.
aTXT looks interesting for data mining - haven't tested it yet.
As mentioned above, most of these are wrappers (or GUIs) around existing - mostly command-line - tools. Eg. a simple tool (which works with your PDF!) in Linux is pdftotext
(if you want to stay in Python, you can call it with subprocess
's call
, or even with os.system
.
After this, you get a text file, which you can process more easily with just basic Python string functions, or regular expressions, or sophisticated things like PyParser.