0

I'm designing a program for entity extraction from HTML pages. I've got a sketch design, but I'm not happy with it, since it strongly couples my algorithm classes with the HTML parser I chose to use. I'd be happy to hear suggestions of better design.

My design as follows:

public interface HTMLSearcherInterface
{
    void readHTML(URI);
    List<SearchResultInterface> searchContent(predicate<String>);
}

public interface SearchResultInterface
{
    String getResultText();
    Node getResultNode();
}

And I have EntityExtractor which holds HTMLSearcherInterface, and use it to search the HTML file for key words around which it would look for other details. This is why I need the getResultNode from the search results.

Mr.WorshipMe
  • 713
  • 4
  • 16

1 Answers1

0

While typing this question down, I thought of the solution:

Having to return the Node to the algorithm to search around it indicated my HTMLSearcherInterface was just no good enough, since every interaction I have with the HTML should be mediated by it and I should not find myself traversing the tree to search it in my algorithm.

So changing it to:

public interface HTMLSearcherInterface
{
    void readHTML(URI);
    void searchContent(predicate<String>);
    void searchInResultSiblings(SearchResultInterface,predicate<String>);
    void searchInResultParents(SearchResultInterface, int maxDistance,predicate<String>);
    void searchInResultChildren(SearchResultInterface, int maxDistance,predicate<String>);
    SearchResultInterface getNextResult();
    bool hasNextResult();

}

public interface SearchResultInterface
{
    String getResultText();
    long id getResultId();
}

Solves the problem of strong coupling to the HTML parsing library. The downside to this solution is that it is no longer thread-safe to use this use the search function and to get the results.

But if thread safety is important, multiple instances of the searcher can be used, and if this is a really big document (which I doubt exists) one could share the same document between searchers by utilizing the factory pattern.

Mr.WorshipMe
  • 713
  • 4
  • 16