1

I am a newbie in programming and have a question:

I try to edit some .vtt files, where I want to remove certain substrings from the text. The file should keep its structure. For this, I copied the .vtt files in the folder and changed it to a .txt ending. Now I run this simple code:

import os

file_index = 0
all_text = []
path = "/Users/username/Documents/programming/IMS/Translate/files/"
new_path = "/Users/username/Documents/programming/IMS/Translate/new_files/"

for filename in os.listdir(path):
    if os.path.isfile(filename):  #check if there is a file in the directory
        with open(os.path.join(path, filename), 'r') as file: # open in read-only mode
            for line in file.read().split("\n"): #read lines and split
                line = " ".join(line.split())
                start_index = line.find("[")  #find the first character of string to remove, this returns the index number
                last_index = start_index + 11  #define the last index to be removed
                if start_index != -1:
                    line = line[:start_index] + line[last_index:] #The new line to slice the first charaters until the one to be removed, and add the others that need to stay
                    all_text.append(line)
                else:
                    line = line[:]
                    all_text.append(line)'''

I get this error message:

> File "srt-files-strip.py", line 11, in <module>
>     for line in file.read().split("\n"): #read lines and split   File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/codecs.py", line 322, in decode
>     (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position
> 3131: invalid start byte

I have search through different forums, changed to encoding="utf16", but to no avail. Strange thing is that it did work earlier on. Then I wrote a program to rename my files automatically, after that, it threw this error. I have cleared all files in the folder, copied the original ones in again ... can't get it to work. Would really appreciate your help, as I have really no idea where to look. Thx

think-maths
  • 917
  • 2
  • 10
  • 28
Ina N.
  • 11
  • 2
  • You have to *know* what the file encoding is, not just guess. If it's a binary file it will have arbitrary byte values mixed with the text and you won't be able to treat the whole file the same. – Mark Ransom Jan 30 '21 at 18:25
  • @MarkRansom 1) The vtt file is generate via a website where I used transcription of videos 2) I open the vtt file in notepad, then save as .txt. The encoding is UTF-8, as it says in the preferences of the note editor 3) That still doesn't explain why it all ran ok when I first used the program, with exactly the same vtt files. – Ina N. Feb 01 '21 at 06:17
  • The encoding reported by Notepad is a guess, not a guarantee - unless you re-saved the file using Notepad. As for why it worked before and not now, I can only suspect file corruption. The error message isn't wrong, 0x80 isn't a valid start byte in UTF-8. – Mark Ransom Feb 02 '21 at 04:03
  • - I have compared the original .vtt file to .txt file with ultra compare,. No difference - I have saved the .vtt file using ultra edit with explicit UTF-8 encoding - still same error message with wrong start byte. - I will reconstruct the original files and try again, maybe they are corrupt as you suggest. - If that doesn't work, I probably go back to doing it manually. May be faster after all ... – Ina N. Feb 03 '21 at 07:56

0 Answers0