What’s the best datamining strategy to extract titles and paragraphes from an html file basing on the elements style (fontSize, fontWeight, …). I already extracted the text and the fontSize attribute and put them in a csv file, now I need to know how to classify (or clusterify ?) this data so that it can give me for example all the elements that has a fontSize of 20px with a tolerance of +- 5px. Those elements will be transformed into h1 tags, and so on..
EDIT: I am able to clusterize the fontSizes into as much clusters as I want using the cluster algorithm Simple KMeans with the Manhattan distance function in Weka. But, I get a precise value for each cluster, for example: the font-size 10px is caught 100 times, the 20px 200 times, etc.. I need to have a range instead of a specific values to cover all the values.