UTF-8 encoded file is picked by chardetect as ASCII

Question

I am writing a single file combining all the files present inside the folder.I want the text file to be UTF-8 encoded.My code is as follows

import os
import codecs
import re
def file_concatenation(path):
    with codecs.open('C:/Users/JAYASHREE/Documents/NLP/text-corpus.txt', 'w',encoding='utf8') as outfile:
        for root, dirs, files in os.walk(path):            
                    for dir_name in dirs:    
                        for fname in os.listdir(root+"/"+dir_name):
                            with open(root+"/"+dir_name+"/"+fname) as infile:
                                for line in infile:                                    
                                    new_line = re.sub('[^a-zA-Z]', ' ',line)                                      
                                    outfile.write(re.sub("\s\s+", " ", new_line.lstrip()))
file_concatenation('C:/Users/JAYASHREE/Documents/NLP/bbc-fulltext/bbc')

When I use chardetect to find my encoding,it is showing as ASCII with confidence 1.0

C:\Users\JAYASHREE>chardetect "C:/Users/JAYASHREE/Documents/NLP/text-corpus.txt"
C:/Users/JAYASHREE/Documents/NLP/text-corpus.txt: ascii with confidence 1.0

Kindly resolve the issue. Thanks

Maybe you have no non-ASCII characters. Try to add a caracter that is not in ASCII in your file. — litelite, Aug 08 '17 at 17:11
@litelite My corpus consists of only words separated by a single space. — Jayashree, Aug 08 '17 at 17:13
@Jayashree what difference does that make? ASCII is a proper subset of unicode. — Jared Smith, Aug 08 '17 at 17:14
@Jayashree Try to add a non-ASCII char and see the difference. Since UTF-8 have been made to be compatible with ASCII any ASCII string is also a valid UTF-8 string so if you have no char that are non-ASCII chardetect have no way of knowing which one it is and defaults to ASCII since it's the simplest of the two. Here's a non-ASCII if you need it for your test --> é — litelite, Aug 08 '17 at 17:15
How about encoding the line before you write it to a file? `outfile.write(re.sub("\s\s+", " ", new_line.lstrip()).**encode('utf-8')**)` — sophros, Aug 08 '17 at 17:18
@JaredSmith I am getting unicodeerror while decoding with utf-8 — Jayashree, Aug 08 '17 at 17:30
@Jayashree then post the error and the text that caused it and ask about that instead of assuming you know what's causing the problem. As litelite and I have said, ASCII is valid unicode (although not vice-versa). — Jared Smith, Aug 08 '17 at 17:32

score 0 · Answer 1 · answered Aug 08 '17 at 17:43

0

Use encoding='utf-8-sig"' to force a BOM at the start of the file. It should get picked up by chardetect.

answered Aug 08 '17 at 17:43

user2722968

13,636
2
46
67

UTF-8 encoded file is picked by chardetect as ASCII

1 Answers1