how to ignore some files using sklearn load_files method?

Question

My mac os will generate a .DS_Store under my train data set file directory, and load_files will load it and raise exception like

UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 1116

I want to know that how to filter the .DS_Store file except delete it?

@miku: He's presumably relying on [`sklearn.datasets.load_files`](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_files.html#sklearn.datasets.load_files) to do the iteration, as his question implies. — abarnert, Jan 01 '13 at 07:15

score 3 · Accepted Answer · edited Jun 20 '20 at 09:12

Looking at the documentation, there doesn't seem to be any way to filter directly in load_files (or, rather, you can whitelist categories, but you can't whitelist files within the categories, or blacklist at either level).

You might want to consider filing a feature request to the scikit-learn project. Alternatively, you might consider it a bug that hidden files (as defined appropriately for the platform—but on OS X and other POSIX systems that should include files whose names start with .) are loaded, and file a bug report on that.

Meanwhile, there is a load_content flag that you can set:

load_content : boolean, optional (default=True)

Whether to load or not the content of the different files. If true a ‘data’ attribute containing the text information is present in the data structure returned. If not, a filenames attribute gives the path to the files.

Pass False, and it will just find the filenames for you, which you can then filter however you want (e.g., filenames = (filename for filename in ret.filenames if not filename.startswith('.'))), then load manually.

This seems like the best solution available with the given tools.

On the other hand, given how simple load_files actually is—especially if you don't use the extra features like categories or shuffle—it might be simpler to just not use it, and instead use os.walk or just os.listdir. In this case, given that the files are exactly 2 levels deep, rather than at an arbitrary depth, the latter is probably simpler:

def getfilenames(category):
    return [filename for filename in os.listdir(category)
            if not filename.endswith('.')]
categoryfiles = [getcategory(os.path.join(rootpath, category)
                 for category in os.listdir(rootpath)]

thank you so much, I have change the code of load_file based on your code. — user1687717, Jan 01 '13 at 08:23
if we set load_content flag to False then how do we load the files manually ? Can you answer my question here. http://stackoverflow.com/questions/17788431/scikit-learn-how-to-know-documents-in-the-cluster — Ashish Negi, Jul 23 '13 at 04:36

miku · Answer 2 · 2013-01-01T07:36:25.873

0

A quick glance at the source of load_files reveals that your only option is to delete the .DS_Store files:

documents = [join(folder_path, d)
    for d in sorted(listdir(folder_path))]

(If you want get serious about the .DS_Store pollution, here's a serious kernel extension: https://github.com/binaryage/asepsis).

edited Jan 01 '13 at 07:36

answered Jan 01 '13 at 07:22

miku

181,842
47
306
310

So passing `load_content=False` and then loading the files iteratively isn't an option? – abarnert Jan 01 '13 at 07:24
@abarnert; I haven't worked with this toolkit, so I can't really answer that. – miku Jan 01 '13 at 07:29
1

Neither have I, but I did read the single short page of documentation. – abarnert Jan 01 '13 at 07:31

score 0 · Answer 3 · answered Jul 28 '15 at 09:30

I have modified the sklearn load_files to accept additional parameter 'ignore_files' which would accept a list of files to be ignored. You can use this definition of load_files instead of sklearn. It returns the same result as of load_files since I am just filtering the files which needs to be ignored.

usage :

load_files(dir_path,ignore_files=".DS_Store")

Source on gist

how to ignore some files using sklearn load_files method?

3 Answers3