0

My Python script converts PDF files in images to be used in PyTesseract:

def images(inputFile):
    pdfFile = wi(filename = inputFile, resolution=600)

    formato = 'png'
    image = pdfFile.convert(formato)

    pag = 0
    dfs = []

    for img in image.sequence:
        pag += 1
        img.rotate(90)

        # HOCR
        with img[1100:4190, 1150:3080] as cropped: #[left:right, top:bottom]
            imgPage = wi(image = cropped)
            imageBlob = imgPage.make_blob(formato)
            horas = gerarHocr(imageBlob)

def gerarHocr(imageBlob):
    image = Image.open(io.BytesIO(imageBlob))
    markup = pytesseract.image_to_pdf_or_hocr(image, lang='por', extension='hocr')
    soup = BeautifulSoup(markup, features='html.parser')

    spans = soup.find_all('span', {'class' : 'ocrx_word'})

    listHoras = []

    for sp in spans:
        hora = horaMarcada(sp.get('title').split()[1], sp.get('title').split()[2], sp.get('title').split()[3], sp.get('title').split()[4], sp.get_text().split()[0])
        listHoras.append(hora)

    return listHoras

images('foo.pdf')

After the execution I have a large amount of Magick files in temp folder that the Python didn't delete by itself.

I tried many solutions to stop Wand producing these files:

Magick-*

I changed the <!-- <policy domain="resource" name="disk" value="16EB"/> --> in the policy.xml to

name="disk" value="1GiB"
<policymap>
  <!-- <policy domain="resource" name="temporary-path" value="/tmp"/> -->
  <!-- <policy domain="resource" name="memory" value="2GiB"/> -->
  <!-- <policy domain="resource" name="map" value="4GiB"/> -->
  <!-- <policy domain="resource" name="width" value="10MP"/> -->
  <!-- <policy domain="resource" name="height" value="10MP"/> -->
  <!-- <policy domain="resource" name="area" value="1GB"/> -->
  <!-- <policy domain="resource" name="disk" value="1GiB"/> -->
  <!-- <policy domain="resource" name="file" value="768"/> -->
  <!-- <policy domain="resource" name="thread" value="4"/> -->
  <!-- <policy domain="resource" name="throttle" value="0"/> -->
  <!-- <policy domain="resource" name="time" value="3600"/> -->
  <!-- <policy domain="system" name="precision" value="6"/> -->
  <!-- <policy domain="coder" rights="none" pattern="MVG" /> -->
  <!-- <policy domain="delegate" rights="none" pattern="HTTPS" /> -->
  <!-- <policy domain="path" rights="none" pattern="@*" /> -->
  <policy domain="cache" name="shared-secret" value="passphrase" stealth="true"/>
</policymap>

But didn't work.

Marcelo Gazzola
  • 907
  • 12
  • 28
  • Stack Overflow is for _programming_ questions. Presumably you ran some Python code to get to where you are? What code did you run? Can you provide a [mcve]? – ChrisGPT was on strike Dec 01 '19 at 19:31
  • check it now, my problem is not with my code, it is a problem with Wand – Marcelo Gazzola Dec 01 '19 at 20:09
  • Be very careful assuming there's nothing wrong with your code. But even if that's not where the problem is we need to know what you're doing if we're going to be able to help. Please read [ask] for more details. Thanks for updating. – ChrisGPT was on strike Dec 01 '19 at 21:40
  • All the "magick-..." files are created when your tmp directory does not have enough space to finish the processing. Imagemagick will not automatically delete those. It will only delete them if the processing is successful. Either increase your tmp directory space or set the tmp directory to one that has enough space. Be sure you have the correct permissions in the tmp directory that Imagemagick can delete them. – fmw42 Dec 01 '19 at 23:06
  • fmw42 if I change the temp folder in the policy.xml so these files should be auto deleted later? – Marcelo Gazzola Dec 02 '19 at 22:04

0 Answers0