0

I am using the SWI-Prolog library(http/http_open). According to the docs, "After [http_open(Url, Stream, [])] succeeds the data can be read from Stream." Thus, I thought maybe I could rig up a simple, declarative predicate to parse phrases from URL's by using phrase_from_stream/2 in library(pure_input):

phrase_from_url(Url, Phrase) :-
    http_open(Url, In, []),
    phrase_from_stream(Phrase, In),
    close(In).

But I suspect there is some nuance to the kinds of stream provided by http_open/3; I receive the following error:

ERROR: set_stream_position/2: stream `<stream>(0x7feebbf5c810)' does not exist (Device not configured)

(I have tested the same url against the example provided on the library(http/http_open) docs, which uses copy_stream_data/2 to pipe the output to user_output, and it works. So I know the url is not at fault.)

I have learned that I can download the data from the url into a string, code-list, or text file, and then use a phrase/n, our cousin, on that. But I'm hoping someone can help inform me about...

  1. ...an elegant/standard solution to parsing data from a url with DCGs
  2. ...maybe some insight into why we cannot use phrase_from_stream/2 on some streams, as one might naively hope.
Shon
  • 3,989
  • 1
  • 22
  • 35

2 Answers2

1

As it is at the moment, library(pure_input) does not support non-repositioning streams. This is the problem.

One solution is to read everything and then use the normal phrase on it. This of course is not the same as the promised "lazy reading".

As for "parsing data from URL", keep in mind that SWI-Prolog has libraries for many things you find on the web: SGML/XML/HTML; JSON; RDF.

For picking out text from an html page, see for example this simple scraper. The relevant code is in scrape/3 and its help predicates. It uses the SWI-Prolog SGML/XML parser and library(xpath).

In the mean time, if you want to use a DCG to parse from a non-repositioning stream, tough luck. library(pure_input) does not even work on the standard input. What you can do, depending on how your data is structured, is either use read_line_to_codes/3 (see the example), if your input is organized line-wise, or read_pending_input/3 if it is not, and read to a buffer.

  • 1
    Perfect! So it turns out to be a quirk related to the magic enabling library(pure_input). Now that I know the term to search for, I see Jan has "Support non-repositioning streams, such as sockets and pipes." on the tbd. Many thanks! – Shon Feb 18 '15 at 06:03
  • 1
    @aBathologist the tdb has been there for a while. My only hope is that someone else but Jan or Ulrich Neumerkel needs it badly enough to actually do the work. –  Feb 18 '15 at 06:06
  • 1
    @aBathologist If you are scraping

    s and such, parsing the html followed by xpath does the job quite well. See for example here: https://github.com/StanfordOSAcademySWIProlog/contentteam/blob/master/scrape.pl The relevant code is in `scrape/3` and its help predicates.

    –  Feb 18 '15 at 06:11
1

As Boris pointed out, non repositioning streams cannot be used with library(pure_input). read_stream_to_codes/2, followed by phrase/2, will give you a practical way to test your grammar against real data.

But, 'real world' HTML is very difficult to parse (even with the support of the builtin SGML parser), because of the poor error handling. So debugging a DCG can be a nightmare, even on well behaved grammars.

CapelliC
  • 59,646
  • 5
  • 47
  • 90
  • 1
    I assumed that he must be parsing something else. But you are right with regards to "real world" html: the struggle is real. –  Feb 18 '15 at 06:05
  • Thanks, CapelliC. @Boris is correct, that I am only grabbing text from urls at the moment (at the worst, scraping out the contents of `

    `s and such. But this is good to know about 'real world' HTML handling. The poor error handling you speak of is apparently on SWI's part? So it would be better, I guess, to parse html in some other language, and then pass it back to my prolog once processed... Something to keep in mind. Thanks!

    – Shon Feb 18 '15 at 06:08
  • 1
    @aBathologist The way I understood CapelliC's answer, real world HTML is very difficult to parse because of the way it is usually used (structure _and_ formatting). Even using a proper, well-tested parser is difficult; if you try writing your own grammer, you are really going to have problems (this is compounded by the fact that DCGs are a bit difficult to debug...) –  Feb 18 '15 at 06:38
  • 1
    @aBathologist: at least, while debugging a DCG, I'd like to have codes lists displayed in readable format, to visualize the 'cursor'. I tried portray_text, set_prolog_flag(double_quotes, codes), and others, without luck... – CapelliC Feb 18 '15 at 09:21