1

I am following the tutorial from haystacks website for Extractive QA system. I am trying to convert PDF to Text. Link to the blog is here : (https://www.deepset.ai/blog/automating-information-extraction-with-question-answering)

I pip installed haystack but I get this error. I even tried !pip install haystack.nodes but that doesn't work.

Note: I am using Google Colab for this.

Here is my detailed code and error:

!pip -q install haystack haystack.nodes
path = '/content/drive/MyDrive/Colab Notebooks/NLP/Information Extraction QA with Haystack (Adidas Financial corpus)'
from haystack.nodes import PDFToTextConverter

pdf_converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=['en'])

converted = pdf_converter.convert(file_path = path, meta = { 'company': 'Company_1', 'processed': False })
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-7-61021fb3b7b8> in <cell line: 1>()
----> 1 from haystack.nodes import PDFToTextConverter
      2 
      3 pdf_converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=['en'])
      4 
      5 converted = pdf_converter.convert(file_path = path, meta = { 'company': 'Company_1', 'processed': False })
ewokx
  • 2,204
  • 3
  • 14
  • 27
Panda Bear
  • 11
  • 1

1 Answers1

1

To install Haystack, you need to run pip install farm-haystack. The pypi package is called farm-haystack and not just haystack as Stefano mentioned. A good starting point are the Haystack tutorials, which you can run as python notebooks on Google Colab, for example this tutorial using the PDFToTextConverter.

Julian Risch
  • 216
  • 1
  • 4