I have a list of pdf files that have different numbers of pages and presentations. Each file contains a list of information that I need to extract. but the problem is that the information is wrapped in different type of phrases and syntax. I need to know if I need to build a machine learning to do this and if it is the case which algorithms and techniques are suited for my case. NB: I have a huge dataset of pdf files to use to train model.
Asked
Active
Viewed 444 times
-3
-
Is your issue simply extracting the information or is it also analyzing it once extracted? If the latter is the case, what is the purpose of your analysis? Without this information no one can guide you. Also mention what you have tried with regards to extracting the text so far. – Fruitspunchsamurai Jan 23 '17 at 16:40
-
For now I just need to extract those information. – abderr080 Jan 23 '17 at 17:15
-
Can you give an example of how the data is structured in your question? You can probably use [Tabula](http://tabula.technology/) depending on how the data is structured. – Fruitspunchsamurai Jan 23 '17 at 17:27
-
For example i want to extract company name: Siemens AG 1st picture, OMRON Corporation 2nd picture and TOKAI RIKA in 3rd and last picture [https://www.dropbox.com/s/mc39qt6cizzd7rc/cpt1.JPG?dl=0], [https://www.dropbox.com/s/4fq7l23c6vqcpcr/cpt2.JPG?dl=0], [https://www.dropbox.com/s/cmcnkf7z9l0747o/cpt3.JPG?dl=0] and [https://www.dropbox.com/s/n5sazg8imrwiocg/cpt4.JPG?dl=0] – abderr080 Jan 23 '17 at 18:10
-
Your tags are all over the place. [tag:python]: why? You don't mention any programming language in your question. [tag:text-extraction]: why? You don't seem to have a problem with *extracting* the text. [tag:pdf]: why? Okay, your sources are PDF files – but your question is not *about* PDF, or the problems you have with it. – Jongware Jan 24 '17 at 10:09
1 Answers
0
So if you want to do this in Python it seems that PyPDF2 is the way to go. You should be able to read in and extract the text data you want from the PDFs. Automate the boring stuff has examples of using PyPDF2.

Fruitspunchsamurai
- 408
- 4
- 13
-
I am using pypdf2 in combination with ocr, because i have scanned pdf to get text from the pdf file. My concern is how to extract some information such as company name, frequencies, modules names, etc, from this text. Those information are wrapped in different context and phrases. My pdf are also made from tables that i cannot get well formated after convesion to text. – abderr080 Jan 23 '17 at 21:00
-
Is there some underlying structure to the context and phrases? If you yourself cannot discern any underlying structure to the data, i'm not sure that you can write something that will. Is there a way you can search using regex for company names and other things? – Fruitspunchsamurai Jan 23 '17 at 21:15
-
Thanks for your response. I think I will search now for how data is structured. I think also Tabula may be a good help for tables analyse. – abderr080 Jan 24 '17 at 09:01