Haystack PDFToTextConverter: getText() got an unexpected keyword argument 'textpage'

Question

I tried the haystack beginner tutorial. It works fine. Now I try to use a local pdf on my PC instead of the articles from the Game of Thrones Wikipedia and I always get an error.

This is the code

from haystack.nodes import PDFToTextConverter
from pathlib import Path


def haystack():
    converter = PDFToTextConverter(
        remove_numeric_tables=True,
        valid_languages=["de"]
    )

    docs = converter.convert(file_path=Path("C:/Users/Franzi/Documents/myPDF.pdf"), meta=None)


if __name__ == '__main__':
    haystack()

Traceback (most recent call last):

File "C:\Users\Franzi\PycharmProjects\pythonProject2\main.py", line 15, in <module>
    haystack()
  File "C:\Users\Franzi\PycharmProjects\pythonProject2\main.py", line 11, in haystack
    docs = converter.convert(file_path=Path("C:/Users/Franzi/Documents/myPDF.pdf"), meta=None)
  File "C:\Users\Franzi\AppData\Local\Programs\Python\Python38\lib\site-packages\haystack\nodes\file_converter\pdf.py", line 171, in convert
    pages = self._read_pdf(
  File "C:\Users\Franzi\AppData\Local\Programs\Python\Python38\lib\site-packages\haystack\nodes\file_converter\pdf.py", line 301, in _read_pdf
    for page in results:
  File "C:\Users\Franzi\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\process.py", line 484, in _chain_from_iterable_of_lists
    for element in iterable:
  File "C:\Users\Franzi\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\_base.py", line 611, in result_iterator
    yield fs.pop().result()
  File "C:\Users\Franzi\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\_base.py", line 439, in result
    return self.__get_result()
  File "C:\Users\Franzi\AppData\Local\Programs\Python\Python38\lib\concurrent\futures\_base.py", line 388, in __get_result
    raise self._exception
TypeError: getText() got an unexpected keyword argument 'textpage'

I am using Python 3.8 and PyCharm 2023.2. I have tried different PDFs and also tried

from haystack.utils import convert_files_to_docs
convert_files_to_docs()

but it gives me the same error. Any ideas what I am doing wrong here?

Hey... Your code works fine for me. I installed Haystack using `pip install farm-haystack[preprocessing,file-conversion,pdf]`. I am running the 1.19 version and I am on Ubuntu (while you are on Windows). Please provide more details about your version. If you require further assistance and want to engage in exchange, you can join the Haystack Discord community: https://discord.gg/haystack — Stefano Fiorucci - anakin87, Jul 27 '23 at 10:48
I used Anaconda Prompt to install Haystack the same was (as described in the tutorial) using `pip install farm-haystack[preprocessing,file-conversion,pdf]` and I am using version 1.17.1, Windows 11 — geoidiot, Jul 27 '23 at 10:56
Is the package PyMuPDF correctly installed? To verify, run the command `pip freeze | grep PyMuPDF` in the terminal. — Stefano Fiorucci - anakin87, Jul 27 '23 at 13:34
Yes, it is installed: `pip freeze | findstr PyMuPDF PyMuPDF==1.22.5` — geoidiot, Jul 28 '23 at 07:44
Very strange... I would suggest to reinstalll haystack in a clean virtual environment. In any case, for more personalized support, you can join the Haystack Discord community... — Stefano Fiorucci - anakin87, Aug 03 '23 at 13:45

Haystack PDFToTextConverter: getText() got an unexpected keyword argument 'textpage'

0 Answers0