Extract titles and paragraphs from html by elements style

Question

What’s the best datamining strategy to extract titles and paragraphes from an html file basing on the elements style (fontSize, fontWeight, …). I already extracted the text and the fontSize attribute and put them in a csv file, now I need to know how to classify (or clusterify ?) this data so that it can give me for example all the elements that has a fontSize of 20px with a tolerance of +- 5px. Those elements will be transformed into h1 tags, and so on..

EDIT: I am able to clusterize the fontSizes into as much clusters as I want using the cluster algorithm Simple KMeans with the Manhattan distance function in Weka. But, I get a precise value for each cluster, for example: the font-size 10px is caught 100 times, the 20px 200 times, etc.. I need to have a range instead of a specific values to cover all the values.

Why would you rely on clustering for this? It's not reliable enough for this. — Has QUIT--Anony-Mousse, Mar 08 '17 at 19:35
I am looking for any advice from you guys. What do you suggest ? — Mehdi Benmoha, Mar 08 '17 at 23:45

score 0 · Answer 1 · answered Mar 09 '17 at 07:22

First of all this would be a comment but i'm new and can't write comments for now.

I am able to clusterize the fontSizes into as much clusters as I want using the cluster algorithm Simple KMeans with the Manhattan distance function in Weka. But, I get a precise value for each cluster, for example: the font-size 10px is caught 100 times, the 20px 200 times, etc.. I need to have a range instead of a specific values to cover all the values.

You can specify the number of clusters with an option named somthing like "numClusters". So you can force weka to build as much clusters as you want, which means it has to use a range instead of specific numbers if you have more different values than clusters

But here is my question why dont you use a simple loop to iterate over your data and specify manualy what you want to have. Something like

if(fontSize < 10) {
/*Do s.th*/
}else if(fontSize < 20){
/*Do s.th.
}

Because that appears to be way more reliable and easy. Even if you have more attributes just define the attributeranges for every cluster by hand and check for any dataSet if it fits in one of your clusters.

I would only recommend something like weka for this task if you have an overwhelming amount of attributes or clusters or not a very good understanding of the data. But your task doesn't look that way.

score 0 · Answer 2 · answered Mar 09 '17 at 14:09

0

Try the machine-learning based boilerpipe java API. You can test the different models on-line

answered Mar 09 '17 at 14:09

Istvan Nagy

310
4
13

Extract titles and paragraphs from html by elements style

2 Answers2