0

My Problem Statement: I have been given multiple PDF Files i.e. >10 and I need to extract values based on keywords; The Keywords for eg. 'Metric Tonnes', 'Volume', 'Climate Change' stay the same for all PDF File.

So for eg. if I open a PDF File say 'A.pdf'. Manually, I open the pdf file, I search the keyword 'Metric tonnes' using CTRL+F and I get multiple instances of the value in the file then I search for it and get a value for this keyword for eg. 333,333. This process I do for the rest of the keywords as well.

How do I extract values for these keywords from a pdf file using spacy python? I have done my research and so far haven't got anything substantial except from here . This however gives the matching keywords and not the values for it. I would like to extract the values from the specified keywords. How do I approach this problem? Please help.

EDIT : One way that I can think of is to extract text data from PDF files is using pdfminer or pypdf2 library, cleaning the text. create a list of keywords, and search it everywhere in the document. if found save it in a list or dataframe if not discard the text but I am not sure how efficient would that be.

technophile_3
  • 531
  • 6
  • 21
  • I don't think NLP is needed for your purpose. You could just use some module to extract text from the PDF (I suggest [PDF Plumber](https://github.com/jsvine/pdfplumber), see as an example [this post](https://stackoverflow.com/questions/66900539/how-to-stop-pdfplumber-from-reading-the-header-of-every-pages/66902615#66902615) on how to use it) and then use some kind of regex to get the values you need. – SilentCloud Oct 21 '21 at 07:57
  • @SilentCloud so, I was thinking in the right direction. please read my EDIT – technophile_3 Oct 21 '21 at 08:28
  • Yes, you could save the extracted values is a list and then dump them into some file. Speaking of efficiency, it depends on how many files you have and how long they are. And your time constraints, also. – SilentCloud Oct 21 '21 at 08:52
  • tbh there are more than 7k files. @SilentCloud – technophile_3 Oct 21 '21 at 08:55
  • 1
    Uhm ok. Anyway, in my experience regex are much faster than NLP, because you don't have to convert the text into tensor and perform all model operations. I guess the only way to know it is to try! – SilentCloud Oct 21 '21 at 08:58
  • 1
    @KJ : As far as the EG is concerned, there are multiple instances of a keyword say 'scope1' in the pdf file. Not all instances might be having values. I am interested in only those which have TEXT' and not some pychart or table..the value could be stored in simple sentence like 'the scope of so and so is 333,333,100' so from this eg: we extract the value. how, that is my question. – technophile_3 Oct 21 '21 at 13:34

0 Answers0