Running into issues opening/encoding a text files in python

Question

Here is the raw text:

Issue / Problem Encountered                         Solution / Lessons

• Sample result on the print out was reported with a

“sample not seen” message indication

• Symbol character (*, ?) next to the sample value

result

• Impact :

– Wrong result / NC generation

– Downtime and delay in Lot disposition

Check for print out errors like:

•    If an error is displayed for example:

“sample not seen” refer to SOP- 013499 and repeat sample.

•    Sample result should not have an

interrogation mark before the sample value.

• Impact to the area:

– Minimize possible OOS results – Minimize NC

– Reduce cost for OOS

Investigation

Always ensure to verify that the print out report:

Does not has the message “sample not seen” and the symbol

“sample not

seen” message

Sample result should not have an interrogation mark before the sample value

characters on the sample value result

Now, I've used the following code to process the data:

for ix, f in enumerate(listdir(directory_learning_group)):
    if isfile(join(directory_learning_group,f)):

        if "OPL" in f:
            try:
                dataset_outer_folder_OPL.loc[ix, "ID"] = f.split('_')[0]
                dataset_outer_folder_OPL.loc[ix, "Filename"] = f

                # Open a file
                fd = io.open(directory_learning_group+'{}'.format(f), encoding = 'utf8', errors = 'ignore')

                # Reading text
                ret = fd.read()
                dataset_outer_folder_OPL.loc[ix, "Text"] = ret
            except:
                print(f)
dataset_learning_group_OPL= dataset_learning_group_OPL.reset_index(drop = True)

And end up with the following result:

'A\x00M\x00L\x00 \x006\x00 \x00P\x00U\x00R\x00 \x00O\x00n\x00e\x00-\x00P\x00o\x00i\x00n\x00t\x00 \x00L\x00e\x00s\x00s\x00o\x00n\x00:\x00 \x00I\x00n\x00c\x00o\x00r\x00r\x00e\x00c\x00t\x00 \x00E\x00n\x00d\x00o\x00t\x00o\x00x\x00i\x00n\x00 \x00r\x00e\x00s\x00u\x00l\x00t\x00 \x00o\x00n\x00 \x00t\x00h\x00e\x00 \x00p\x00r\x00i\x00n\x00t\x00 \x00o\x00u\x00t\x00 \x00r\x00e\x00p\x00o\x00r\x00t\

I'm having trouble understanding what exactly is happening here. This .txt doesn't look that different than the other files that I am able to read in without issues.

Even we I try to decode/encode it, it doesn't help at all.

Any help/guidance would be much appreciated.

Is it [`UTF-16` perhaps](https://stackoverflow.com/q/4735566/4650297)? — BruceWayne, Oct 17 '18 at 02:49
that actually worked. Do you know how I can check if a file needs to be encoded utf-8 or utf-16? — madsthaks, Oct 17 '18 at 03:10

score 0 · Answer 1 · answered Oct 17 '18 at 02:54

You should probably post your whole code into the question. Anyhow, I tested a raw text file with what you have posted and it works for the following code on Python 3.x:

with open('10020_OPL Endotoxin testing.txt', 'rb') as f:
    file = f.readlines()
    print(file)

Running into issues opening/encoding a text files in python

1 Answers1