0

I built a Python app - it's very straightforward. A file selection box opens, a user chooses a PDF file, and then the text from the PDF is exported to a CSV.

I packaged this as a .exe from within a virtualenv, I only installed the libraries I'm importing (plus PyMuPDF), and the package is still 1.4GB.

The script:

import textract
import csv
import codecs
import fitz
import re
import easygui
from io import open


from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

filename = easygui.fileopenbox()

pdfFileObj = fitz.open(filename)

text =""

for page in pdfFileObj:
    text+= page.getText()
    re.sub(r'\W+', '', text)

if text != "":
    text = text

else: 
    text = textract.process(filename, method='tesseract', language='eng')

tokens = word_tokenize(text)

punctuations = ['(',')',';',':','[',']',',']

stop_words = stopwords.words('english')

keywords = [word for word in tokens if not word in stop_words and not word in punctuations]

with open('ar.csv', 'w', newline='', encoding='utf-8') as f:
    write = csv.writer(f)
    for i in keywords:
        write.writerow([i])

Some context:

Within my venv, the entire lib folder is about 400MB. So how do I find out what is being added to the .exe that's making it 1.4GB?

  • Is there any error or problem here? – Timothy Chen Dec 14 '20 at 00:52
  • Try to follow this question https://stackoverflow.com/questions/47692213/reducing-size-of-pyinstaller-exe – mtdot Dec 14 '20 at 00:55
  • @mtdot Yes that's the post I've read previously. I created a new virtualenv, so I'm not sure why it would ever get to be 1.4gb. These packages should all be lightweight. – DecentExperience Dec 14 '20 at 01:11
  • You could try compile the exe importing the packages one by one, and figure out which one is the culprit for increasing the size of your file – oskros Jan 05 '21 at 12:44

0 Answers0