Extract information from HTML using Mathematica

Question

Is there an easy way to extract data from specific HTML tables using Mathematica? Import seems to be pretty powerful, and Mathematica appears to be capable of handling formats such as XML pretty well.

Here's an example: http://en.wikipedia.org/wiki/Unemployment_by_country

IMO, if you're using version 8, JSON is the way to go. There are tons of API's in the wild (typically slinging XML or JSON your way). I wouldn't recommend killing time ripping unemployment data from a Wiki. Find the primary source for what you're interested in and it'll probably have an API. If you just want to rip something quickly, you might also try linked cells in Excel---then you can import into MMA. (Disregard all of this if you just wanna have fun and learn. In that case, parse away!!) :D — telefunkenvf14, Jan 11 '12 at 01:34

score 13 · Accepted Answer · edited Jan 11 '12 at 09:55

For general examples of this there are these How tos:

How to | Clean Up Data Imported from a ZIP File
How to | Clean Up Data Imported from a Website

For this specific example just import it

tmp = Import["http://en.wikipedia.org/wiki/Unemployment_by_country", "Data"]

Cleaning it up is fairly straight forward with this import. The table is 3 columns so extract it from the rest of the stuff:

tmp1 = Cases[tmp, {_, _?NumberQ, _}, \[Infinity]]

You will presumably want to remove the square bracket references (??):

tmp1[[All, 3]] = Flatten[If[StringQ[#], 
StringCases[#, x__ ~~ Whitespace ~~ "[" ~~ __ :> x], #] & /@ tmp1[[All, 3]]]

Grid[tmp1, Frame -> All]

Note also you can add the header back if you want it in your table, which you probably do

Grid[Join[{{"Country / Region", "Unemployment rate (%)", 
   "Source / date of information"}}, tmp1], Frame -> All]

purists might object to the last step but when you are scraping data generally you just want to get the job done and each site is a case by case prospect. So some manual inspection and flexibility gets you the fastest overall result.

Edit

if you wanted the flags you could also get them from CountryData. Some further cleaning up is needed otherwise a lot of misses will occur. The cleanup involves removing the reference to the "sovereign country" in parenthesis. e.g. "Guam ( United States )" -> "Gaum".

tmp2 = Flatten[
  If[StringMatchQ[#, __ ~~ "(" ~~ __], 
     StringCases[#, 
      z__ ~~ Shortest["(" ~~ __ ~~ ")" ~~ EndOfString] :> 
       StringTrim@z], StringTrim[#]] & /@ tmp1[[All, 1]]]

This will still produce some output that CountryData does not recognize.

flags = CountryData[#, "Flag"] & /@ tmp2;
Cases[flags, _CountryData]

6 misses out of 190. Remove those misses from the output:

flags = If[Head[#] === CountryData, {""}, {#}] & /@ flags; (*much faster than rule replacement*)
tmp2 = Join[flags, tmp1, 2];
Grid[tmp2, Frame -> All]

Note that this takes a while to render.

enter image description here

You can obviously style the Grid as desired using Grid options and also resize the images if needed.

Regarding `(*much faster than rule replacement*) `, this is faster than your code: `List /@ Replace[flags, _CountryData -> "", 1]`. (+1, btw) — Mr.Wizard, Jan 12 '12 at 10:21
You're right. I had tested `ReplaceAll` which is quite slow. `Replace` is much faster. — Mike Honeychurch, Jan 12 '12 at 11:51

score 6 · Answer 2 · answered Jan 10 '12 at 20:59

6

While the use of Import is probably a better and more robust way, I found that, at least for this particular problem, my own HTML parser (published in this thread), works fine with a small amount of post-processing. If you take the code from there and execute it, augmenting it with this function:

Clear[findAndParseTables];
findAndParseTables[text_String] :=
  Module[{parsed = postProcess@parseText[text]},
    DeleteCases[
      Cases[parsed, _tableContainer, Infinity],
      _attribContainer | _spanContainer, Infinity
    ] //.
    {(supContainer | tdContainer | trContainer | thContainer)[x___] :> {x},
        iContainer[x___] :> x,
        aContainer[x_] :> x,
        "\n" :> Sequence[],
       divContainer[] | ulContainer[] | liContainer[] | aContainer[] :> Sequence[]}];

Then you get, I think, a pretty much complete data by this code:

text = Import["http://en.wikipedia.org/wiki/Unemployment_by_country", "Text"];
myData = First@findAndParseTables[text];

Here is how the result looks:

In[92]:= Short[myData,5]
Out[92]//Short= 
tableContainer[{{Country / Region},{Unemployment rate (%)},{Source / date of information}},
{{Afghanistan},{35.0},{2008,{3}}},{{Albania},{13.49},{2010 (Q4),{4}}},
{{Algeria},{10.0},{2010 (September),{5}}},<<188>>,{{West Bank},{17.2},{2010,{43}}},
{{Yemen},{35.0},{2009 (June),{128}}},{{Zambia},{16.0},{2005,{129}}},{{Zimbabwe},{97.0},{2009}}]

What I like about this approach (as opposed to say, Import->XMLObject) is that, since I convert the web page into Mathematica expression with minimal syntax (unlike e.g. XML objects), it is often very easy to establish a set of replacement rules which does the right post-processing in each given case. A final disclaimer is that my parser is not robust and does for sure contain a number of bugs, so be warned.

answered Jan 10 '12 at 20:59

Leonid Shifrin

22,449
4
68
100

You must have enough material to write another Mathematica book by now. In fact, I hope you do. ;-) – nilo de roock Jan 11 '12 at 11:27
@ndroock1 Thanks! I am working on it, but recently I had too much direct work to have enough free time to do it fast. It is one thing to answer posts here at SO, but writing a serious book needs much more time than that, at least until the core is finished. Hope to get more time soon. By the way, there is a new Mathematica SE site proposal: http://area51.stackexchange.com/proposals/37304/mathematica . Please consider supporting it if you have not already done so. – Leonid Shifrin Jan 11 '12 at 11:48
@ndroock1 Just to add to the previous: the proposal is now making its (hopefully final) steps to go out of the commit stage into beta. Following is not enough, it does not automatically get transferred into a commit. – Leonid Shifrin Jan 11 '12 at 12:03
Thanks for the reminder, I have pushed the commit button. If I have to do more, please let me know. – nilo de roock Jan 12 '12 at 11:31
@ndroock1 Thanks! Strange, you did not yet appear in the list of new committers (uless I missed you somehow). Could it be that you committed to the old (closed) proposal? It may also be that SE has a significant time lag in displaying the new committers. – Leonid Shifrin Jan 12 '12 at 14:19
I did not appear in the list because I had not confirmed my e-mail address. - Note that's quite a valuable database of e-mail addresses they are building. And they ( the owners in New York ) will sell it, over and over and over. – nilo de roock Jan 13 '12 at 20:29
@ndroock1 Oh, I see. You may also be right about the e-mail list :) – Leonid Shifrin Jan 13 '12 at 20:34

Szabolcs · Answer 3 · 2015-01-28T18:33:09.263

Not a direct answer to how to import HTML (which others have explained nicely), but getting data from HTML tables is precisely why I originally made my table paste palette.

If your aim is to just get the data, this is probably going to be easier and faster than trying to parse the page.

Instructions on using the palette

Evaluate the expression that creates the palette, go to Palettes -> Install Palette... and save it permanently for later use (if you wish).
Select a part of the table on the webpage. If you are working with Firefox, hold down CTRL to select any rectangular section of the table (very useful!) Copy it.
If you are using Firefox or Chrome, press the TSV button on the palette to paste the data into the notebook at the current insertion point. I'm not sure if other browsers also separate items with tabs when copying.

The result will look like this:

{{"Afghanistan", 35.`, "2008[3]"}, {"Albania", 13.49`, 
  "2010 (Q4)[4]"}, {"Algeria", 10.`, 
  "2010 (September)[5]"}, {"American Samoa (United States)", 23.8`, 
  "2010[3]"}, {"Andorra", 2.9`, 2009}}

As you can see, some post-processing is needed to convert years to a proper format (string or integer?)

This is the old palette code. I realize it's in need of cleanup, but it works as it is, and I haven't had time to fix it up yet. Report any issues in comments below.

CreatePalette@Column@{Button["TSV",
    Module[{data, strip},
     data = NotebookGet[ClipboardNotebook[]][[1, 1, 1]];
     strip[s_String] := 
      StringReplace[s, RegularExpression["^\\s*(.*?)\\s*$"] -> "$1"];
     strip[e_] := e;
     If[Head[data] === String,
      NotebookWrite[InputNotebook[],
       ToBoxes@Map[strip, ImportString[data, "TSV"], {2}]]
      ]
     ]
    ],
   Button["CSV",
    Module[{data, strip},
     data = NotebookGet[ClipboardNotebook[]][[1, 1, 1]];
     strip[s_String] := 
      StringReplace[s, RegularExpression["^\\s*(.*?)\\s*$"] -> "$1"];
     strip[e_] := e;
     If[Head[data] === String,
      NotebookWrite[InputNotebook[],
       ToBoxes@Map[strip, ImportString[data, "CSV"], {2}]]
      ]
     ]
    ],
   Button["Table",
    Module[{data},
     data = NotebookGet[ClipboardNotebook[]][[1, 1, 1]];
     If[Head[data] === String,
      NotebookWrite[InputNotebook[],
       ToBoxes@ImportString[data, "Table"]]
      ]
     ]
    ]}

This is nice. I wish I understood regular expressions. It seems so cryptic :) — Mike Honeychurch, Jan 11 '12 at 21:35
@Mike It's just a `StringTrim`. I wrote this for Mathematica 6 originally which has no `StringTrim` built in. — Szabolcs, Jan 11 '12 at 22:29
@Szabolcs ok. Nevertheless I should learn regular expressions so as to make code a bit more concise if nothing else. Using string patterns is fine but can be quite lengthy at times. — Mike Honeychurch, Jan 11 '12 at 22:38

score 5 · Answer 4 · answered Jan 10 '12 at 20:13

5

Import[
  "http://en.wikipedia.org/wiki/Unemployment_by_country",
  "Data"]

Of course, the result will frequently need further processing. How do you want to visualize it?

You can find all Import types using

Import[
  "http://en.wikipedia.org/wiki/Unemployment_by_country",
  "Elements"]

answered Jan 10 '12 at 20:13

Mark McClure

4,862
21
34

I'll visualize it somehow, but the main thing is to first create a matrix corresponding to the table without everything else. – C. E. Jan 10 '12 at 20:20
If `"Data"` doesn't work, I'd try `"XMLObject"` next, combined with careful use of `Cases`. That approach gets messy pretty quickly though. – Brett Champion Jan 10 '12 at 20:22
1

+1 For pointing out `Import[..., "Elements"]`[.](http://reference.wolfram.com/mathematica/ref/Import.html#405487078) – Simon Jan 11 '12 at 09:56

score 4 · Answer 5 · answered Jan 11 '12 at 00:11

If you want to go the Import[ ... , "XMLObject" ] route, here is an outline of what you can do.

First, get the page:

page = Import["http://en.wikipedia.org/wiki/Unemployment_by_country", "XMLObject"];

Next, get the table of interest (in this case the big table also happens to be the first of seven tables on this page):

table = Cases[page, XMLElement["table", ___], \[Infinity]][[1]]

Next, get a row from the table, I picked the fourth row which corresponds with Algeria:

row = Cases[table, XMLElement["tr", ___], [Infinity]][[4]]

Next, extract the table data elements () from this row:

data = Cases[row, XMLElement["td", ___], \[Infinity]]

Out of those elements, you can pick for example the country flag thumbnail, like so:

image = Cases[data, XMLElement["img", {___, "src" -> src_, ___}, _] :> src, \[Infinity]]

Finally import that image thumbnail (it needed "http:" prepended for some reason):

Import["http:" <> image]

This is what the notebook looks like (the thumbnail, plus the other inputs):

Mathematica graphics

score 3 · Answer 6 · answered Jan 10 '12 at 20:14

For certain values of 'easy', yes. See here: HTML Import documentation for Mathematica 8.

You can import from tables using the "Data" format option, e.g. Import["file.hml", "Data"]. That's a start, but your link is a whole DOM-tree's worth of tables, divs and other things. It's documented, but thinly, and you'd have to experiment. It does work with URLs though.

This actually works. With a bit of cleaning you could use the data here:

Import["http://en.wikipedia.org/wiki/Unemployment_by_country", "Data"]

Extract information from HTML using Mathematica

6 Answers6