-1

I can't import pdftext in my new mac M1. The steps I took are:

  1. Install python 3.10

  2. Install command line developer tools

  3. pip3 install pdftotext from terminal

  4. Open IDLE, type import pdftotext

  5. I get this error:

    Traceback (most recent call last): File "<pyshell#9>", line 1, in import pdftotext ImportError: dlopen(/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pdftotext.cpython-310-darwin.so, 0x0002): symbol not found in flat namespace '_ZN7poppler24set_debug_error_functionEPFvRKNSt3__112basic_stringIcNS0_11char_traitsIcEENS0_9allocatorIcEEEEPvES9'

I have already spent a few hours searching for this error message.

Any suggestions?

PS: I have tried several other pdf -> text packages, but they don't read the full pdf. For some weird reason, the pdfs I need to read are really complex and many packages don't read them fully. pdftotext does. So what I need is help to make this pdftotext work.

Antonio
  • 21
  • 6
  • 1
    My guess is that this is a problem with the native code portion of the library. Have you checked the site for `pdftotext` to see if it is stated if the library should work on Apple silicon? You might want to find a forum specific to the package and post this question there. – CryptoFool Mar 06 '22 at 18:01
  • Thanks for the suggestion: I have just posted a new issue in the package site https://pypi.org/project/pdftotext/ – Antonio Mar 06 '22 at 18:36

1 Answers1

-1

i dont think pdftotext good library. use PyPDF2 its better and here is example

import PyPDF2
 
#create file object variable
#opening method will be rb
pdffileobj=open('1.pdf','rb')
 
#create reader variable that will read the pdffileobj
pdfreader=PyPDF2.PdfFileReader(pdffileobj)
 
#This will store the number of pages of this pdf file
x=pdfreader.numPages
 
#create a variable that will select the selected number of pages
pageobj=pdfreader.getPage(x+1)
 
#(x+1) because python indentation starts with 0.
#create text variable which will store all text datafrom pdf file
text=pageobj.extractText()
 
#save the extracted data from pdf to a txt file
#we will use file handling here
#dont forget to put r before you put the file path
#go to the file location copy the path by right clicking on the file
#click properties and copy the location path and paste it here.
#put "\\your_txtfilename"
file1=open(r"C:\Users\SIDDHI\AppData\Local\Programs\Python\Python38\\1.txt","a")
file1.writelines(text)
hmody3000
  • 47
  • 1
  • Thanks a lot. However, PyPDF2 does not read all the text in the PDF. It misses a lot of text. That is why I chose pdftotext. It would help if you know how to make pdftotext work in a Mac M1. – Antonio Mar 06 '22 at 17:45
  • 1
    Recommending a tool or library is out of scope for SO. This also doesn't answer the question that was asked. – CryptoFool Mar 06 '22 at 17:56