2

I use the sgml library of prolog to extract information about a web page. I use this instruction to extract all:

load_structure('file.html', List, [dialect(sgml), shorttag(false), max_errors(-1)])

the system loads the page but i have some warnings, for instance:

WARNING:SGML2PL(sgml): inserted omitted end-tag for "img"
WARNING:SGML2PL(sgml): inserted omitted end-tag for "br"
WARNING:SGML2PL(sgml): entity "amp" does not exist

How can i do to eliminate this warnings?

Guy Coder
  • 24,501
  • 8
  • 71
  • 136
Joachim Low
  • 277
  • 1
  • 2
  • 11

1 Answers1

2

I use this syntax

get_html_file(FileOrStream, P) :-
        dtd(html, DTD),
        load_structure(FileOrStream, [P],
                       [ dtd(DTD),
                         dialect(sgml),
                         shorttag(false),
                         syntax_errors(quiet),
                         max_errors(-1)
                       ]).

the option syntax_errors(quiet) should do.

I recall I had some hard time parsing old pages with errors. Error handling can be complicated, some tool like tags soup, being more tolerant, could help in getting the work sone...

CapelliC
  • 59,646
  • 5
  • 47
  • 90
  • In dtd(html,DTD), what are the values of variable DTD? I must insert it? – Joachim Low Oct 04 '13 at 10:06
  • no, SWI-Prolog will fill the structure on the fly, while parsing. It's optional, kind of a recap... I sometime used to deepen the analysis. – CapelliC Oct 04 '13 at 10:09