Following code should walk through directory and grab XML files and process them (i.e. prefixing HTML classes stored in XML elements — however, this is not important in relation to the question). The code works as long as there are no subdirectories inside "/input-dir", but as soon as there are subdirectories, an error message gets thrown out:
Traceback (most recent call last): File "/Users/ab/Code/SHCprefixer-2022/shc-prefixer_upwork.py", line 22, in content = file.readlines(); File "/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 566: invalid start byte
from bs4 import BeautifulSoup
import xml.etree.ElementTree as ET
import os
import lxml
import re
input_path = "./input-dir";
output_path = "./output-dir";
ls = os.listdir(input_path);
print(ls);
with open("classes.txt", "r") as cls:
clss = cls.readlines()
for i in range(len(clss)):
clss[i] = clss[i].strip()
print(clss);
for d in range(len(ls)):
with open(f"{input_path}/{ls[d]}", "r") as file:
content = file.readlines();
content = "".join(content)
bs_content = BeautifulSoup(content, "lxml")
str_bs_content = str(bs_content)
str_bs_content = str_bs_content.replace("""<?xml version="1.0" encoding="UTF-8"?><html><body>""", "");
str_bs_content = str_bs_content.replace("</body></html>", "");
for j in range(len(clss)):
str_bs_content = str_bs_content.replace(clss[j], f"prefix-{clss[j]}")
with open(f"{output_path}/{ls[d]}", "w") as f:
f.write(str_bs_content)
Probably the error is related to the listdir() command, and as indicated in "IsADirectoryError: [Errno 21] Is a directory: " It is a file, I should use os.walk()
, but I wasn't able to implement it. Would be great if someone could help.