0

I am writing a single file combining all the files present inside the folder.I want the text file to be UTF-8 encoded.My code is as follows

import os
import codecs
import re
def file_concatenation(path):
    with codecs.open('C:/Users/JAYASHREE/Documents/NLP/text-corpus.txt', 'w',encoding='utf8') as outfile:
        for root, dirs, files in os.walk(path):            
                    for dir_name in dirs:    
                        for fname in os.listdir(root+"/"+dir_name):
                            with open(root+"/"+dir_name+"/"+fname) as infile:
                                for line in infile:                                    
                                    new_line = re.sub('[^a-zA-Z]', ' ',line)                                      
                                    outfile.write(re.sub("\s\s+", " ", new_line.lstrip()))
file_concatenation('C:/Users/JAYASHREE/Documents/NLP/bbc-fulltext/bbc')

When I use chardetect to find my encoding,it is showing as ASCII with confidence 1.0

C:\Users\JAYASHREE>chardetect "C:/Users/JAYASHREE/Documents/NLP/text-corpus.txt"
C:/Users/JAYASHREE/Documents/NLP/text-corpus.txt: ascii with confidence 1.0

Kindly resolve the issue. Thanks

Jayashree
  • 811
  • 3
  • 13
  • 28
  • 1
    Maybe you have no non-ASCII characters. Try to add a caracter that is not in ASCII in your file. – litelite Aug 08 '17 at 17:11
  • @litelite My corpus consists of only words separated by a single space. – Jayashree Aug 08 '17 at 17:13
  • 1
    @Jayashree what difference does that make? ASCII is a proper subset of unicode. – Jared Smith Aug 08 '17 at 17:14
  • 1
    @Jayashree Try to add a non-ASCII char and see the difference. Since UTF-8 have been made to be compatible with ASCII any ASCII string is also a valid UTF-8 string so if you have no char that are non-ASCII chardetect have no way of knowing which one it is and defaults to ASCII since it's the simplest of the two. Here's a non-ASCII if you need it for your test --> é – litelite Aug 08 '17 at 17:15
  • How about encoding the line before you write it to a file? `outfile.write(re.sub("\s\s+", " ", new_line.lstrip()).**encode('utf-8')**)` – sophros Aug 08 '17 at 17:18
  • @JaredSmith I am getting unicodeerror while decoding with utf-8 – Jayashree Aug 08 '17 at 17:30
  • @Jayashree then post the error and the text that caused it and ask about that instead of assuming you know what's causing the problem. As litelite and I have said, ASCII is valid unicode (although not vice-versa). – Jared Smith Aug 08 '17 at 17:32

1 Answers1

0

Use encoding='utf-8-sig"' to force a BOM at the start of the file. It should get picked up by chardetect.

user2722968
  • 13,636
  • 2
  • 46
  • 67