0

For a part of my job we make a comprehensive list based on all files a user has in their drive. These users have to decide per file whether to archive these or not (indicated by Y or N). As a service to these users we manually fill this in for them.

We export these files to a long list in excel, which displays each file as X:\4. Economics\10. xxxxxxxx\04. xxxxxxxxx\04. xxxxxxxxxx\filexyz.pdf

I'd argue that we can easily automate this, as standard naming conventions make it easy to decide which files to keep and which to delete. A file with the string "CAB" in the filename should for example be kept. However, I have no idea how and where to start. Can someone point me in the right direction?

1 Answers1

0

I would suggest the following general steps

  1. Get the raw data

You can read the excel file into a pandas dataframe in python. Ideally you will have a raw dataframe that looks something like this

     Filename                           Keep
0    X:\4. Economics ...\filexyz.pdf    0
1    X:\4. Economics ...\fileabc.pdf    1
2    X:\3. Finance   ...\filetef.pdf    1
3    X:\3. Finance   ...\file123.pdf    0
4    G:\2. Philosophy ..\file285.pdf    0
                   ....
  1. Preprocess/clean

This part is more up to you, for example you could remove all special characters and numbers. This would leave letters as follows

     Filename                     Keep
0    "X Economics filexyz pdf"    0
1    "X Economics fileabc pdf"    1
2    "X Finance filetef pdf"      1
3    "X Finance file123 pdf"      0
4    "G Philosophy file285 pdf"   0
                ....
  1. Vectorize your strings

For an algorithm to understand your text data, you typically vectorize them. This means you turn them into numbers that the algorithm can process. An easy way to do this is with tf-idf and scikit-learn. After this your dataframe might look something like this

     Filename                               Keep
0    [0.6461,  0.3816 ...  0.01,  0.38]     0
1    [0.,      0.4816 ...  0.25,  0.31]     1
2    [0.61,    0.1663 ...  0.11,  0.35]     1
                       ....
  1. Train a classifier

Now that you have nice numbers for the algorithms to work with, you can train a classifier with scikit-learn. Simply search for "scikit learn classification example" and you will find plenty.

Once you have a trained classifier, you can compare its predictions on test data that it has not seen before. That way you get a feeling for accuracy.

Hopefully that is enough to get you started!

Hakim K
  • 1
  • 4