I built a Python app - it's very straightforward. A file selection box opens, a user chooses a PDF file, and then the text from the PDF is exported to a CSV.
I packaged this as a .exe from within a virtualenv, I only installed the libraries I'm importing (plus PyMuPDF), and the package is still 1.4GB.
The script:
import textract
import csv
import codecs
import fitz
import re
import easygui
from io import open
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
filename = easygui.fileopenbox()
pdfFileObj = fitz.open(filename)
text =""
for page in pdfFileObj:
text+= page.getText()
re.sub(r'\W+', '', text)
if text != "":
text = text
else:
text = textract.process(filename, method='tesseract', language='eng')
tokens = word_tokenize(text)
punctuations = ['(',')',';',':','[',']',',']
stop_words = stopwords.words('english')
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
with open('ar.csv', 'w', newline='', encoding='utf-8') as f:
write = csv.writer(f)
for i in keywords:
write.writerow([i])
Some context:
Within my venv, the entire lib folder is about 400MB. So how do I find out what is being added to the .exe that's making it 1.4GB?