How to classify URLs? what are URLs features? How to select and Extract features from URL

Question

I have just started to work on a Classification problem. Its a two class problem, My Trained model(Machine Learning) will have to decide/predict either to allow a URL or Block it.

My Question is very specific.

How to Classify URLs? Should i use normal text analysis methods?
What are URLs Features?
How to Select and Extract Features from URL?

I have dataset which has URLs. I want to train my model to classify URL as adults content or non-adult content. basically the model is for filtering purpose. want to block webpages which are objectionable, using URL with downloading the page contents and other features like meta data in webpages. so this is a two class problem. My question is How can we classify webpages from just using URLs features. The problem i am having is that what are the best features extraction method i can use? — Nasir, Oct 23 '14 at 00:35
plus, Is there any API libraries which has build-in function for this purpose. I am new to machine learning, please correct me where i am wrong. i will be using python. — Nasir, Oct 23 '14 at 00:35

score 8 · Accepted Answer · answered Oct 21 '14 at 00:06

8

I assume you do not have access to the content of the URL thus you can only extract features from the url string itself. Otherwise it makes more sense to use the content of the URL.

Here are some features I will try. See this paper for more ideas:

All url components. For example, this page has the below url:

http://stackoverflow.com/questions/26456904/how-to-classify-urls-what-are-urls-features-how-to-select-and-extract-features

All tokens that occurs in different parts of URLs should have variable value to the classification. In this case, the last part after tokenization contributes great features for this page. (e.g., classify, urls, select, extract, features)

 * stackoverflow
 * com
 * questions
 * 26456904
 * how to classify urls what are urls features how to select and extract features

The length of a url;
n-grams (2-grams as examples below)
- stackoverflow-com
- com-questions
- questions-26456904
- 26456904-how
- how-to
- ....

answered Oct 21 '14 at 00:06

greeness

15,956
5
50
80

greeness, u explained it nicely i read some papers where they achieved to classify webpages by just using URL features. I am abit confuse in extracting features from URL which are simple. like www.google.com it do not have enough features. if i decide to extract 6 features from all URLs from datasets in training the algorithm, what will happend when simple URL get in the way? – Nasir Oct 22 '14 at 23:05
Most of the features you are using would be sparse. Instead of 6 features, you probably mean 6 types of features or 6 feature families. In `google.com` example, the only useful feature is the token "google", which should have strong connections to a label like "search engine". The connection should be learned from your labeled dataset. Therefore you don't need to worry about the **insufficient feature** at this example. – greeness Oct 23 '14 at 00:45
Thanks Greenes, is it like i will tell my estimator/classifier that tokens which are in start on an example have more weight then tokens which reside in end of lengthy examples? – Nasir Oct 23 '14 at 20:26
It's better to let your machine learning model figure that out. – greeness Oct 24 '14 at 03:37

How to classify URLs? what are URLs features? How to select and Extract features from URL

1 Answers1

Linked