Retrieve the file. We must use getURL()
because the schema is https:, otherwise we could have used doc <- htmlParse(url)
directly.
url <- "https://stat.ethz.ch/pipermail/r-help/2009-January/date.html"
jan14 <- getURL(url, ssl.verifypeer = FALSE)
htmlParse()
parses the text that we have just retrieved. It is the same as htmlTreeParse()
, but easier to type.
doc <- htmlParse(jan14, asText=TRUE)
We do not need a regular expression to parse the text file; this would be error-prone and difficult. Instead we use XPath to identify the text value of italicized items inside lists; this is where the author names appear in the html.
who <- sapply(doc["//li/i/text()"], xmlValue)
who
is a character vector of contributor names; the only 'unwanted' characters are white space characters (including new lines) at the end of each element. A regular expression matching one or more white space characters at the end of a character vector is [[:space:]]+$
; we can use sub()
to replace each occurrence with nothing (""
). The table()
function creates a table that counts how many times each author contributed. sort()
takes this result and orders these from least to most frequent contributor. tail()
returns the last (6 by default, we specify 10) entries.
tail(sort(table(sub("[[:space:]]+$", "", who))), 10)
The result is
> tail(sort(table(sub("[[:space:]]+$", "", who))), 10)
Greg Snow Henrique Dallazuanna hadley wickham
35 36 40
Marc Schwartz Wacek Kusnierczyk jim holtman
48 55 80
Duncan Murdoch Prof Brian Ripley David Winsemius
84 84 93
Gabor Grothendieck
116