0

I would like to extract some general properties of pdf files. So far this has worked very well, except I've encountered a weird new error today while trying out a new input file.

For the parsing, I'm using pdfminer.six. This is what the code to extract title, author, subject, etc. looks like:

pdf_data = {
    'Number of words': len(words),
    'Number of paragraphs': len(paragraphs),
    'Number of pages': len(pages)
}

if len(doc.info) > 0:
    pdf_data_keys = ["latin", "windows-1252"]
    for key in pdf_data_keys:
        properValue = ""
        value = doc.info[0].get(key)
        if value is not None:
            for enc in encodings:
                    try:
                        properValue = value.decode(encoding=enc)
                        break
                    except Exception as ex:
                        continue
        if properValue != "":
            pdf_data[key] = properValue

Now, after running the code, I don't get any errors, but this shows up for title in the new file:

'Title': 'Ú\x9f\x00ä\x00?\x00_\x00ë\x00È\x00/\x00Å\x00Á\x00\x80\x00(\x00ë\x00ä\x00ñ\x00\x80\x00ï\x00?\x00Ê\x00%\x00À\x00\x80\x00è\x00ê\x00+\x00\x80\x00í\x00ä\x00ñ\x00è\x00ë\x00\x80\x00á\x00è\x00ã\x00\x80\x00@\x00\x80\x00á\x00è\x00ã\x00\x91\x00\x91\x00\x90\x00\x80\x00@\x00\x80\x00<\x00í\x00\x90\x00\x93\x00\x99\x00\x16\x00\x94\x00\x99\x00\x94\x00\x95\x00\x96\x00\x16\x00\x80\x00@\x00\x80\x00ü\x00Â\x00Á\x00Ê\x00Ë\x00Ñ\x00Ä\x00Ç\x00È\x00\x80\x00@\x00\x80\x00Â\x00?\x00Á\x00Ê\x00Ë\x00Á\x00\x05\x00Ë\x00È\x00Í\x00È\x00È\x00Å\x00/\x00Ê\x00È\x00\x06\x00À\x00Á'}

I've tried putting in every other type of encoding in existence, transforming the short list up there into this:

encodings = ['ascii', 'big5', 'big5hkscs', 'cp037', 'cp273', 'cp424', 'cp437', 'cp500', 'cp720', 'cp737',
                 'cp775', 'cp850', 'cp852', 'cp855', 'cp856', 'cp857', 'cp858', 'cp860', 'cp861', 'cp862', 'cp863',
                 'cp864', 'cp865', 'cp866', 'cp869', 'cp874', 'cp875', 'cp932', 'cp949', 'cp950', 'cp1006',
                 'cp1026', 'cp1125', 'cp1140', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254', 'cp1255',
                 'cp1256', 'cp1257', 'cp1258', 'cp65001', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213', 'euc_kr',
                 'gb2312', 'gbk', 'gb18030', 'hz', 'iso2022_jp', 'iso2022_jp_1', 'iso2022_jp_2', 'iso2022_jp_2004',
                 'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1', 'iso8859_2', 'iso8859_3', 'iso8859_4',
                 'iso8859_5', 'iso8859_6', 'iso8859_7', 'iso8859_8', 'iso8859_9', 'iso8859_10', 'iso8859_11',
                 'iso8859_13', 'iso8859_14', 'iso8859_15', 'iso8859_16', 'johab', 'koi8_r', 'koi8_t', 'koi8_u',
                 'kz1048', 'mac_cyrillic', 'mac_greek', 'mac_iceland', 'mac_latin2', 'mac_roman', 'mac_turkish',
                 'ptcp154', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213', 'utf_32', 'utf_32_be', 'utf_32_le',
                 'utf_16', 'utf_16_be', 'utf_16_le', 'utf_7', 'utf_8', 'utf_8_sig', "idna", "mbcs", "oem", "palmos",
                 "punycode", "raw_unicode_escape", "rot_13", "undefined", "unicode_escape", "unicode_internal",
                 "base64_codec", "bz2_codec", "hex_codec", "quopri_codec", "string_escape", "uu_codec",
                 "zlib_codec"]

Yet this doesn't help either.

I'm an absolute beginner when it comes to Python, and I would really appreciate any help with this.

Have a great day!

Lena
  • 21
  • 6

1 Answers1

0

The string is UTF-16BE encoded. You can check for the BOM accordingly: \xFE\xFF

The text string type is specified in PDF 32000-1:2008 - 7.9.2.2. It can be encoded as PDFDocEncoding (simliar to ISO Latin 1) or UTF-16BE. In PDF 2.0 this can also be an UTF-8 encoded string.

Jan Slabon
  • 4,736
  • 2
  • 14
  • 29
  • 1
    Hey Jan, thank you for your response. As you can see up there, I included all types of UTF-encoding in the loop. Sadly, that didn't fix the issue :/ – Lena Nov 30 '17 at 08:56