Processing files with listdir() breakes when directory contains subdirectories

Question

Following code should walk through directory and grab XML files and process them (i.e. prefixing HTML classes stored in XML elements — however, this is not important in relation to the question). The code works as long as there are no subdirectories inside "/input-dir", but as soon as there are subdirectories, an error message gets thrown out:

Traceback (most recent call last): File "/Users/ab/Code/SHCprefixer-2022/shc-prefixer_upwork.py", line 22, in content = file.readlines(); File "/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 566: invalid start byte

from bs4 import BeautifulSoup
import xml.etree.ElementTree as ET
import os
import lxml
import re

input_path = "./input-dir";
output_path = "./output-dir";

ls = os.listdir(input_path);
print(ls);

with open("classes.txt", "r") as cls:
    clss = cls.readlines()
    for i in range(len(clss)):
        clss[i] = clss[i].strip()
print(clss);

for d in range(len(ls)):

    with open(f"{input_path}/{ls[d]}", "r") as file:
        content = file.readlines();
        content = "".join(content)
        bs_content = BeautifulSoup(content, "lxml")
        str_bs_content = str(bs_content)
        str_bs_content = str_bs_content.replace("""<?xml version="1.0" encoding="UTF-8"?><html><body>""", "");
        str_bs_content = str_bs_content.replace("</body></html>", "");
        for j in range(len(clss)):
            str_bs_content = str_bs_content.replace(clss[j], f"prefix-{clss[j]}")
    with open(f"{output_path}/{ls[d]}", "w") as f:
        f.write(str_bs_content)

Probably the error is related to the listdir() command, and as indicated in "IsADirectoryError: [Errno 21] Is a directory: " It is a file, I should use os.walk(), but I wasn't able to implement it. Would be great if someone could help.

Yes! And it would be best if only XML files are processed to avoid issues. — Madamadam, Dec 08 '22 at 23:01
You don't really know they are xml files without trying. Unless you want to assume that its just everything with ".xml" extensions. Is that okay? — tdelaney, Dec 08 '22 at 23:04

score 1 · Answer 1 · answered Dec 08 '22 at 23:01

Looks like you will need to filter out directories from the input path dir. You could use os.path.isfile(x) to check it. Using list comprehension you can get the filtered list in one line:

ls = [f for f in os.listdir(input_path) if os.path.isfile(f)]

tdelaney · Accepted Answer · 2022-12-09T00:45:51.940

You need to test whether the returned file system name is a file. You also want to search the entire subtree. Instead of listdir you could use os.walk, but I think that the newer pathlib module better suites your needs. Its .glob method, when used with "**", will search the subtree and filter for a known file extension at the same time.

from bs4 import BeautifulSoup
import xml.etree.ElementTree as ET
import lxml
import re
from pathlib import Path

input_path = Path("./input-dir")
output_path = Path("./output-dir")

ls = [p for p in input_path.glob("**/*.xml") if p.is_file()]
print(", ".join(str(p) for p in ls))

with open("classes.txt", "r") as cls:
    clss = cls.readlines()
    for i in range(len(clss)):
        clss[i] = clss[i].strip()
print(clss)

for infile in ls:
    with infile.open() as file:
        bs_content = BeautifulSoup(file.read(), "lxml")
        str_bs_content = str(bs_content)
        str_bs_content = str_bs_content.replace("""<?xml version="1.0" encoding="UTF-8"?><html><body>""", "");
        str_bs_content = str_bs_content.replace("</body></html>", "");
        for j in range(len(clss)):
            str_bs_content = str_bs_content.replace(clss[j], f"prefix-{clss[j]}")
    outfile = output_path / infile.relative_to(input_path)
    outfile.parent.mkdir(parents=True, exist_ok=True)
    with outfile.open("w") as f:
        f.write(str_bs_content)

There's an error thrown out because the output-directory structure doesn't exist, it has to be generated (only the top folder exists from the beginning): ```File "/Users/ab/Code/SHCprefixer-2022/shc-prefixer_stacko2.py", line 28, in with outfile.open("w") as f: File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/pathlib.py", line 1119, in open return self._accessor.open(self, mode, buffering, encoding, errors, FileNotFoundError: [Errno 2] No such file or directory: 'output-dir/studium-d/scales.xml'``` — Madamadam, Dec 08 '22 at 23:39
@Madamadam - I updated to create directories. Hopefully that fixes it. — tdelaney, Dec 09 '22 at 01:07

Processing files with listdir() breakes when directory contains subdirectories

2 Answers2