1

I have a huge HTML file with text, tables and images (with alt info). I have a full text search function only for this file, but at the moment I use a strict way with string comparison. I want to improve the function and return the top 5 paragraphs (<p></p>), tables or images sorted in base of a query.

A few problems I have now:

Example 1 (misspelling):

Query: "sta**kc**overflow"
Text: "....this is stackoverflow...." 

Example 2 (strict comparison):

Query: "full text searching"
Text:  "...full searching..."

I have made a research for ready libraries in Python and I found elasticsearch and Whoosh but it is hard to find an example in documentation for HTML full text search. Do you have any example or another library that you could suggest?

Sfinos
  • 279
  • 4
  • 15
  • 2
    Are you looking for ways to compare strings? for instance [Levenshtein distance](http://en.wikipedia.org/wiki/Levenshtein_distance), in that case i'm voting to close this question because it's been asked so many times and we're not here to help you find libraries, we're here to solve programming problems, actual problems.. If you're just looking to index HTML data? THen [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) is your answer.. If any of the two, I'm not sure you understand what ElasticSearch is, it's a database/text-file search engine, it only indexes data for frontends.. – Torxed Jun 03 '14 at 15:57
  • Do you want to search inside all the HTML (also inside the balises/tags/attributes/CDATA/...) or only in the text part ? I recommand you BeautifulSoup for this kind of thing. The first option is very easy and the second require only a depth pass in the HTML tree. – Pierre Turpin Jun 03 '14 at 15:58
  • No, I am not looking for Levenshtein Distance. I am looking for a library that could search in an HTML file using Information Retrieval techniques to solve problems that string comparison has. – Sfinos Jun 03 '14 at 16:02

1 Answers1

-1

Try BeautifulSoup - very easy to install and get up to speed with, and well-respected in the Python community. Good documentation too:

http://www.crummy.com/software/BeautifulSoup/bs4/doc/

There's even a

   soup.get_text()

function, amongst many others.

Hektor
  • 1,845
  • 15
  • 19
  • I know Bs4 and I have worked with it, but as I know it doesn't solve the problems in my examples whereas ElasticSearch and Whoosh do. – Sfinos Jun 03 '14 at 16:00
  • Then I think you'll need to write a couple of wrapper functions to get bs4 to do what you want - regular expressions are included as search criteria in bs4 but you'll need to write a wrapper function to use conditionals to decide what is returned. Also not sure what you mean by the 'images sorted in base of a query'. – Hektor Jun 03 '14 at 16:04
  • I want to return paragraphs, tables and images in base of the similarity with the search query. – Sfinos Jun 03 '14 at 16:08
  • In this case, I think lxml is better for your purpose. You can use XPath to search text inside p, table and image tags ; Or to set the starts of depth search. – Pierre Turpin Jun 03 '14 at 16:12