1

My mac os will generate a .DS_Store under my train data set file directory, and load_files will load it and raise exception like

UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 1116

I want to know that how to filter the .DS_Store file except delete it?

Vsevolod Dyomkin
  • 9,343
  • 2
  • 31
  • 36
user1687717
  • 3,375
  • 7
  • 26
  • 29
  • Can you show us how you iterator over the files? – miku Jan 01 '13 at 07:13
  • @miku: He's presumably relying on [`sklearn.datasets.load_files`](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_files.html#sklearn.datasets.load_files) to do the iteration, as his question implies. – abarnert Jan 01 '13 at 07:15

3 Answers3

3

Looking at the documentation, there doesn't seem to be any way to filter directly in load_files (or, rather, you can whitelist categories, but you can't whitelist files within the categories, or blacklist at either level).

You might want to consider filing a feature request to the scikit-learn project. Alternatively, you might consider it a bug that hidden files (as defined appropriately for the platform—but on OS X and other POSIX systems that should include files whose names start with .) are loaded, and file a bug report on that.

Meanwhile, there is a load_content flag that you can set:

load_content : boolean, optional (default=True)

Whether to load or not the content of the different files. If true a ‘data’ attribute containing the text information is present in the data structure returned. If not, a filenames attribute gives the path to the files.

Pass False, and it will just find the filenames for you, which you can then filter however you want (e.g., filenames = (filename for filename in ret.filenames if not filename.startswith('.'))), then load manually.

This seems like the best solution available with the given tools.

On the other hand, given how simple load_files actually is—especially if you don't use the extra features like categories or shuffle—it might be simpler to just not use it, and instead use os.walk or just os.listdir. In this case, given that the files are exactly 2 levels deep, rather than at an arbitrary depth, the latter is probably simpler:

def getfilenames(category):
    return [filename for filename in os.listdir(category)
            if not filename.endswith('.')]
categoryfiles = [getcategory(os.path.join(rootpath, category)
                 for category in os.listdir(rootpath)]
Community
  • 1
  • 1
abarnert
  • 354,177
  • 51
  • 601
  • 671
  • thank you so much, I have change the code of load_file based on your code. – user1687717 Jan 01 '13 at 08:23
  • if we set load_content flag to False then how do we load the files manually ? Can you answer my question here. http://stackoverflow.com/questions/17788431/scikit-learn-how-to-know-documents-in-the-cluster – Ashish Negi Jul 23 '13 at 04:36
0

A quick glance at the source of load_files reveals that your only option is to delete the .DS_Store files:

documents = [join(folder_path, d)
    for d in sorted(listdir(folder_path))]

(If you want get serious about the .DS_Store pollution, here's a serious kernel extension: https://github.com/binaryage/asepsis).

miku
  • 181,842
  • 47
  • 306
  • 310
0

I have modified the sklearn load_files to accept additional parameter 'ignore_files' which would accept a list of files to be ignored. You can use this definition of load_files instead of sklearn. It returns the same result as of load_files since I am just filtering the files which needs to be ignored.

usage :

load_files(dir_path,ignore_files=".DS_Store")

Source on gist

Jeevan
  • 8,532
  • 14
  • 49
  • 67