Python fixing mojibake using Ftfy Issue

Question

Few of the text files that I'm importing has mojibake, so I'm trying to fix them using the ftfy library prior to feeding them to Spacy (NLP). The code snippet relating to this issue:

import spacy
import classy_classification
import pandas as pd
import ftfy


with open ('SID - Unknown.txt', "r", encoding="utf8") as k:
    Unknown = k.read().splitlines()

data = {}
data["Unknown"] = Unknown

# NLP model
spacy.util.fix_random_seed(0)
nlp = spacy.load("en_core_web_md")
nlp.add_pipe("text_categorizer", 
    config={
        "data": data,
        "model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
        "cat_type": "multi-label",
        "device": "gpu"
    }
)

print(ftfy.fix_text(Unknown))

I get the error:

AttributeError: 'list' object has no attribute 'find'

When I look up based on this error, lots of threads have suggested to use index() instead of find() in the case of lists. But in this case, find is done within ftfy.fix_text. How can I get through this error? I want it to stay in the form of list since that's how I feed it into the Spacy model.

Thank you

score 1 · Accepted Answer · answered Dec 08 '22 at 08:40

Welcome to Stackoverflow!

As you noticed, your error happens within ftfy.fix_text. So when we know something is going wrong in a function that we haven't written ourselves, the next thing we can have a look at is "What are we inputting in that function?" .

In your case, you are giving Unknown as an input. Unknown is made like this:

with open ('SID - Unknown.txt', "r", encoding="utf8") as k:
    Unknown = k.read().splitlines()

And this is where things are going wrong: Unknown is a list of strings but the ftfy.fix_text function expects a string, as you can find some examples here.

So the solution to your problem can be either:

Concatenate all of the lines together into 1 string, separating each line with a space character (or anything you want):

val singleString = ' '.join(Unknown)
print(ftfy.fix_text(singleString))

Print the output of ftfy.fix_text for each different line:

for line in Unknown:
  print(ftfy.fix_text(line))

Hope this helps!

Thank you so much. The second solution worked great for me: for line in Unknown: print(ftfy.fix_text(line)) Is there a way that I can assign to a variable? I tried for line in Unknown: Unknown2 = ftfy.fix_text(line) but didn't work. I essentially want it to be assigned to the data list, so I can feed into Spacy. I have multiple files like Unknown, but each line in each file should be identified as a separate item, so Spacy can match against similar lines when I feed in the test dataset. — Sang, Dec 08 '22 at 20:28
You're welcome! Don't hesitate to upvote and mark the answer as complete if it helped you, this helps maintaining a high quality website. Here on Stackoverflow, the idea is to ask 1 single question per post. If you have another question (like you do now) do make another post and you'll get an answer quite quickly! :) — Koedlt, Dec 08 '22 at 21:04

Python fixing mojibake using Ftfy Issue

1 Answers1