My Problem Statement: I have been given multiple PDF Files i.e. >10 and I need to extract values based on keywords; The Keywords for eg. 'Metric Tonnes', 'Volume', 'Climate Change' stay the same for all PDF File.
So for eg. if I open a PDF File say 'A.pdf'. Manually, I open the pdf file, I search the keyword 'Metric tonnes' using CTRL+F and I get multiple instances of the value in the file then I search for it and get a value for this keyword for eg. 333,333. This process I do for the rest of the keywords as well.
How do I extract values for these keywords from a pdf file using spacy python? I have done my research and so far haven't got anything substantial except from here . This however gives the matching keywords and not the values for it. I would like to extract the values from the specified keywords. How do I approach this problem? Please help.
EDIT : One way that I can think of is to extract text data from PDF files is using pdfminer or pypdf2 library, cleaning the text. create a list of keywords, and search it everywhere in the document. if found save it in a list or dataframe if not discard the text but I am not sure how efficient would that be.