Python: classify text into the categories

Question

I have a part of training set

url  category
ebay.com/sch/Mens-Clothing-/1059/i.html?_from=R40&LH_BIN=1&Bottoms%2520Size%2520%2528Men%2527s%2529=33&Size%2520Type=Regular&_nkw=Джинсы&_dcat=11483&Inseam=33&rt=nc&_trksid=p2045573.m1684 Онлайн-магазин
google.ru/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=%D0%BA%D0%BA%D1%83%D0%BF%D0%BE%D0%BD%D1%8B%20aliexpress%202016  Search
google.ru/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#newwindow=1&q=%D0%BA%D1%83%D0%BF%D0%BE%D0%BD%D1%8B+aliexpress+2016    Search
google.ru/search?q=авито&oq=авито&aqs=chrome..69i57j0l5.1608j0j7&sourceid=chrome&es_sm=122&ie=UTF-8 Search
irecommend.ru/content/kogda-somnenii-byt-ne-mozhet-tolko-klear-blyu-pomozhet    Форумы и отзывы
ebay.com/sch/Mens-Clothing-/1059/i.html?_from=R40&LH_BIN=1&Bottoms%2520Size%2520%2528Men%2527s%2529=33&Size%2520Type=Regular&_dcat=11483&Inseam=33&_nkw=Джинсы&_sop=15  Онлайн-магазин
ebay.com/sch/Mens-Clothing-/1059/i.html?_from=R40&LH_BIN=1&Bottoms%2520Size%2520%2528Men%2527s%2529=33&Size%2520Type=Regular&_dcat=11483&Inseam=33&_nkw=Джинсы&_sop=15  Онлайн-магазин
irecommend.ru/content/gramotnyi-razvod-na-dengi-bolshe-ne-kuplyu-vret   Форумы и отзывы
google.ru/search?q=яндекс&oq=яндекс&aqs=chrome..69i57j69i61l3j69i59l2.1383j0j1&sourceid=chrome&es_sm=93&ie=UTF-8    Search
google.ru/search?q=авито&oq=авито&aqs=chrome..69i57j69i59j69i60.1095j0j1&sourceid=chrome&es_sm=93&ie=UTF-8  Search
otzovik.com/review_1399716.html#debug   Форумы и отзывы
svyaznoy.ru Онлайн-магазин
mvideo.ru/smartfony-sotovye-telefony/apple-iphone-2927  Онлайн-магазин
mvideo.ru/promo/rassrochka-0-0-12-mark24197850/f/category=iphone-914?sort=priceLow&_=1453896710474&categoryId=10    Онлайн-магазин
svyaznoy.ru/catalog/phone/224/tag/windows-phone Онлайн-магазин
google.it/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=%D0%B5%D0%B2%D1%80%D0%BE%D1%81%D0%B5%D1%82%D1%8C    Search
vk.com   Social network

it's a connection between url and category And also I have test set and I need to get category to every url.

url    
vk.com/topic-102849764_32295213
stats.stackexchange.com/questions/19048/what-is-the-difference-between-test-set-and-validation-set
google.ru/search?q=learning+sample&oq=learning+sample&aqs=chrome..69i57.4063j0j1&sourceid=chrome&ie=UTF-8#newwindow=1&q=machine+learning+test+and+learn
facebook.com
locals.ru
tvzvezda.ru/news/vstrane_i_mire/content/201609261038-k6n1.htm

I don't know, what algorithm should I use to solve this task. I need the best way to get the most accuracy. And I think it's a problem, that I have multiple categories.

I try first parse html tag title, because I think, that I can's determine category only with url.

score 2 · Accepted Answer · edited May 23 '17 at 10:34

2

Basically you will classify strings into categories. Therefore you will to use a classifier. But you will not just use one classifier but rather test several and chose the most accurate.

Yet firstly, you will have to think about features of each url. I expect that you will not achieve great accuracy if you are simply feeding the url as a string and as the only feature.

Rather you will preprocess each url to extract features. The choice of relevant/useful features strongly depends on the domain. A feature could be:

simple features

the first word until the dot such as: facebook for "facebook.com"
the length of the whole string

complex features

imagine you define keywords for each cluster such as for "online-shopping"-cluster you will define [promo, buy, shop, sell, price], then you can compute the number of keywords which occur in the string for each cluster as a feature

Therefore, you will have to continue firstly with feature-engineering and secondly with a comparing classifier performance.

Additional input:

Similiar question on SO (regarding URL features)

Text feature extraction

Fast Webpage Classification Using URL Features

EDIT: An example

url = "irecommend.ru/content/kogda-somnenii-byt-ne-mozhet-tolko-klear-blyu-pomozhet"    

f1  = len(url) = 76
f2 = base = str(url).split("/",1)[0] = "irecommend.ru"
f3 = segments = str(a).count("/") = 2

more solutions from here by Eiyrioü von Kauyf

import string
count = lambda l1,l2: sum([1 for x in l1 if x in l2])

f4 = count_punctuation = count(a,set(string.punctuation))
f5 = count_ascii = count(a,set(string.ascii_letters))

Yet all these examples are very simple features, which do not cover the semantic content of the URL. Depending on the depth/sophistication of your target variables (clusters), you might need to use features n-gram based features such as in here

edited May 23 '17 at 10:34

Community

1
1

answered Sep 28 '16 at 11:06

Nikolas Rieble

2,416
20
43

what is the best? to classify urls or get `title` from html page and try classify that? – Petr Petrov Sep 28 '16 at 11:51
1

@PetrPetrov You should choose the algorithm. Easiest to understand is linear regression (requires tuning of parameters, need scaling, sometimes data sampling), RandomForests (don't require any hyper parameters, they don't require scaling), Nearest Neighbors (require tuning number of neighbors, a very simple integer parameter). There are several well-known other algorithms. Choose the best (with the best error metrics), tune hyper parameters and get a result with a required precision (in your case it's the percantage of classes predicted correctly or some more complex metrics). – sergzach Sep 28 '16 at 13:01
1

To find best parameters use GridSearchCV. – sergzach Sep 28 '16 at 13:06
@sergzach, can you say, when I try to use linear regression I get error `ValueError: could not convert string to float: 'eldorado.ru/cat/detail/71131483?show=response#customTabAnchor'` and when I try to do this with sentence (title of url), I get this error too. How should I use sentence to get category? – Petr Petrov Sep 28 '16 at 13:25
1

@PetrPetrov Usually linear regression works with numeric data, not with strings. In your case you work not with regression, because you want to classify. To do this use classifiers. You should input not strings, but categories (convert your strings to numbers, each number is an index of array in which all your strings are unique). You should convert your string parameters into category ones (in fact integers). The library does not work with strings directly, we need integers or floats. – sergzach Sep 28 '16 at 13:41
@sergzach, okay, I understand that. Can you say, should I convert string to numeric only with category (y_train), or I should convert to numeric urls and title of url (X_train)? – Petr Petrov Sep 28 '16 at 13:45
1

@PetrPetrov You should convert both url, category and y into **real categories** (as sklearn understand, e.g. integer values). May be you should preprocess your urls because it has parameters. May be parts of your urls should be removed. For example, may be remove parameters to get more repeated urls then categorize them. – sergzach Sep 28 '16 at 13:50
@PetrPetrov Excuse me, may be *y* is not necessary to convert into categories. Save all your data in csv format then read with help of pandas. There are many tutorials how to work with pandas/sklearn. – sergzach Sep 28 '16 at 14:03
To feed these algorithms, you need a feature matrix. Right now you only have one string (the url) for each observation. It would be much better if you extract relevant features instead of feeding the whole URL – Nikolas Rieble Sep 28 '16 at 14:25
If classifying the title from html page or the url works better - I do not know. I guess both works poor if you feed the string without using features. – Nikolas Rieble Sep 28 '16 at 14:25
@NikolasRieble can you say, what function should I use, give me an example, please, I can't understand your advise. – Petr Petrov Sep 28 '16 at 20:46
I added an example, yet text classification (string classification) is not an easy thing to do and you might need to do more research than just looking for a function. – Nikolas Rieble Sep 29 '16 at 07:53

Python: classify text into the categories

1 Answers1