How to get the content or title of a wikipedia page using erlang?

Question

-module(wikipedia).
-export([main/0]).
-define(Url, "http://en.wikipedia.org/w/api.php?format=xml&action=parse&prop=sections&page=Chicago").
-define(Match, "^[A-Za-z]+[A-Za-z0-9]*$").

main() ->
    inets:start(),
    %% Start ssl application
  ssl:start(),
    {ok, {_Status, _Header, Body}} = httpc:request(?Url),
    T = re:run(Body, ?Match, [{capture, all_but_first, binary}]),
    io:format("~s~n",[T]).

I want to store the content of the wikipedia page in "T" using the reqular expression Match. And then I was going to fetch the title. But this above code says nomatch. I am not getting how to fetch the title of a wikipedia page using erlang. Please help.(I am new to erlang). [I want something like :https://stackoverflow.com/questions/13459598/how-to-get-titles-from-a-wikipedia-page]

What line has the `nomatch` error? Can you include the stacktrace in your question? — Stratus3D, Jul 29 '17 at 19:33
Also, that page is xml, so I'd recommend using http://erlang.org/doc/apps/xmerl/xmerl_ug.html to parse the XML and extract the content you want. — Stratus3D, Jul 29 '17 at 19:35
Ah ok, so the `io:format/2` call is printing `nomatch`, which means that is the value of `T`. Which means the `re:run/3` call didn't find anything matching your regex. — Stratus3D, Jul 31 '17 at 00:58
That would make sense, since your regex doesn't allow for anything besides letters and numbers, but the XML is going to contain many other characters. What is that regex suppose to be doing? — Stratus3D, Jul 31 '17 at 00:59
My aim was to fetch "title" and "summary". I was testing the code if it can fetch anything or not(that is why that regex). Can you help me with this? It will be helpful. @Stratus3D — hithard, Aug 01 '17 at 07:20
If your wanting to see if the command fetched anything you do not need the regex. All the XML should be returned if you remove the `re:run/3` call and just print the body instead. — Stratus3D, Aug 01 '17 at 13:56

score 2 · Answer 1 · answered Aug 02 '17 at 05:47

First, I think the title is already in your URL: "Chicago", if that the case just pattern match the URL to Obtain the title. If not that the case I suggest that you should use an XML parsing module like xmlerl:

-module(parse_title).
-include_lib("xmerl/include/xmerl.hrl").

-export([main/0]).

main() ->
  inets:start(),
  ssl:start(),
  U =  "http://en.wikipedia.org/w/api.php?format=xml&action=parse&prop=sections&page=Chicago",
  {ok, {_, _, Body}} = httpc:request(U),
  {Xml,_} = xmerl_scan:string(Body),
  [Title|_] = [Value || #xmlAttribute{value = Value} <- xmerl_xpath:string("//api/parse/@title", Xml)],
  Title.

How to get the content or title of a wikipedia page using erlang?

1 Answers1