0

I have a problem statement in hand and I need to know whether it can be solved by machine learning or not. It goes like this :-

I have a system in which a user can upload documents, so let's say we have a file named xxxZxxx.xxx

User goes multiple levels into the system's folder structure and places the file, (say) A/B/C/D/Z/xxxZxxx.xxx

We need to make a system that reads the file name and suggests the path where it is to be placed.

In this case the file name contains the last part of path, which is a Business Object directory but it may not contain. We have such paths and documents in order of 10^5.

And new paths i.e. business objects may be added with time, which makes this a multi-class classification with approx 10^5 classes that keep on increasing

Is this solvable ?

I tried using a bag of characters (Inspired from bag of words) as a feature vector which failed.

Any comments on any approach that can be followed for this ? Let me know if any other information is needed I will edit the question or change the tags.

divyenduz
  • 2,037
  • 19
  • 38

1 Answers1

0

So to make it a truly ML problem please answer the followings:

1) Why cann't you just read the filename and get the chid folder where the file needs to be placed? Is it because as you said user may not proved the name of the child folder as part of the filename? Or is it because there might be many directories with the name that user provided?

2) ML problems typically have patterns that are statistical in nature which are harder to identify with simple naked eye e.g. using regex. Here you can easily find the appropriate folder using a regular expression search, no?

Abhimanu Kumar
  • 1,751
  • 18
  • 20
  • Hi, I have gone through the options you mentioned. User may or may not give the file name as child folder although it will be only one. It is not a problem that can be taken care of using reg-ex. We are trying to find a pattern in the naming convention of people if that makes it more clear. – divyenduz Sep 05 '14 at 12:46
  • So you are saying that the user may give a name that might match with a directory already present or it may be a different name altogether. And You dont know the user naming convention hence you dont know what pattern to look for. If the above is true and you want to turn it into and ML problem then it has to be a supervised learning. Do you have previous user data where in you already know what user given filename corressponds to what directory structure? If this is there then we can think further of putting this as an ML problem else all hope is lost I think for it to be an ML problem. – Abhimanu Kumar Sep 05 '14 at 18:54
  • Why it is very hard to put it as an unsupervised learning problem is because you are expecting (or there is) a 100% correct response for every filename that user throws at you. So you have a distinct label (directory structure) for every datapoint (filename) – Abhimanu Kumar Sep 05 '14 at 18:55
  • Yes, I have data for 6x10^5 + records – divyenduz Sep 07 '14 at 17:16
  • In that case, I would recommend the simplest approach first: make each directory configuration a single class and train a one vs all classifier for each class. Try this for top 10/15 classes in your dataset and see how good you perform. Then you can apply other sophisticated techniques such as grouping classes after that. – Abhimanu Kumar Sep 08 '14 at 22:22
  • We are currently using Random Forest on 2% of data i.e. around 10000 records with 1500 classes. And we get 50% accuracy. Can you suggest something for a problem with this many number of classes ? – divyenduz Sep 09 '14 at 11:16
  • Random Forest is a good approach. But Doing a hierarchical classification is a better approach in this case. See this paper http://research.microsoft.com/en-us/um/people/sdumais/sigir00.pdf (you can keep using your decision trees instead of SVM used in the paper). And given that you have hierarchical directory structure it is very likely that you will find a class hierarchy. – Abhimanu Kumar Sep 11 '14 at 18:30