Looking at the documentation, there doesn't seem to be any way to filter directly in load_files
(or, rather, you can whitelist categories, but you can't whitelist files within the categories, or blacklist at either level).
You might want to consider filing a feature request to the scikit-learn project. Alternatively, you might consider it a bug that hidden files (as defined appropriately for the platform—but on OS X and other POSIX systems that should include files whose names start with .
) are loaded, and file a bug report on that.
Meanwhile, there is a load_content
flag that you can set:
load_content : boolean, optional (default=True)
Whether to load or not the content of the different files. If true a ‘data’ attribute containing the text information is present in the data structure returned. If not, a filenames attribute gives the path to the files.
Pass False
, and it will just find the filenames for you, which you can then filter however you want (e.g., filenames = (filename for filename in ret.filenames if not filename.startswith('.'))
), then load manually.
This seems like the best solution available with the given tools.
On the other hand, given how simple load_files
actually is—especially if you don't use the extra features like categories
or shuffle
—it might be simpler to just not use it, and instead use os.walk
or just os.listdir
. In this case, given that the files are exactly 2 levels deep, rather than at an arbitrary depth, the latter is probably simpler:
def getfilenames(category):
return [filename for filename in os.listdir(category)
if not filename.endswith('.')]
categoryfiles = [getcategory(os.path.join(rootpath, category)
for category in os.listdir(rootpath)]