A DCG based solution
I would like to add a DCG based solution to the already existing solutions.
Advantages of DCGs
There are a few major advantage of using DCGs for this task:
- You can easily test your parser interactively, without having to modify a separate file.
- A sufficiently general DCG can be used to parse as well as generate test data.
- Knowing this method may come in handy for more complex parsing tasks, which do not fit a predetermined format like CSV.
Preliminaries
The following code assumes the setting:
:- set_prolog_flag(double_quotes, chars).
I recommend this setting because it makes working with DCGs more readable.
Building block: token//1
We start with a short definition of what a token means:
token(T) -->
alnum(L),
token_(Ls),
!, % single solution: longest match
{ atom_chars(T, [L|Ls]) }.
alnum(A) --> [A], { char_type(A, alnum) }.
token_([L|Ls]) --> alnum(L), token_(Ls).
token_([]) --> [].
Sample queries
Here are a few examples:
?- phrase(token(T), "GOLD").
T = 'GOLD'.
?- phrase(token(T), "2").
T = '2'.
?- phrase(token(T), "GOLD 2").
false.
The last example makes clear that whitespace cannot be part of a token.
Whitespace
We regard as whitespace the following sequences:
spaces --> [].
spaces --> space, spaces.
space --> [S], { char_type(S, space) }.
Solution
Hence, a sequence of tokens separated by whitespace is:
tokens([]) --> [].
tokens([T|Ts]) --> token(T), spaces, tokens(Ts).
And that's it!
We can now transparently apply this DCG to files, using Ulrich Neumerkel's visionary library(pio)
for:
Here is wumpus.data
:
$ cat wumpus.data
GOLD 3 2
WUMPUS 3 3
PIT 2 1
PIT 3 4
Using phrase_from_file/2
to apply the DCG to the file, we get:
?- phrase_from_file(tokens(Ts), 'wumpus.data').
Ts = ['GOLD', '3', '2', 'WUMPUS', '3', '3', 'PIT', '2', '1', 'PIT', '3', '4'] .
From such a list of tokens, it is easy to derive the necessary data, using for example again a DCG:
data([]) --> [].
data([D|Ds]) --> data_(D), data(Ds).
data_(gold(X,Y)) --> ['GOLD'], coords(X, Y).
data_(wumpus(X,Y)) --> ['WUMPUS'], coords(X, Y).
data_(pit(X,Y)) --> ['PIT'], coords(X, Y).
coords(X, Y) --> atom_number(X), atom_number(Y).
atom_number(N) --> [A], { atom_number(A, N) }.
We can use these DCGs together to:
- tokenize a file or given list of characters
- parse the tokens to create structured data.
Sample query:
?- phrase_from_file(tokens(Ts), 'wumpus.data'),
phrase(data(Ds), Ts).
Ts = ['GOLD', '3', '2', 'WUMPUS', '3', '3', 'PIT', '2', '1'|...],
Ds = [gold(3, 2), wumpus(3, 3), pit(2, 1), pit(3, 4)] .
See dcg for more information about this versatile mechanism.
1Please note that SWI-Prolog ships with an outdated version of library(pio)
, which does not work with double_quotes
set to chars
. Use the version supplied by Ulrich directly if you want to try this with SWI-Prolog.