-1

A situation in which we want to know the 10 most frequent posters to the R-help list serve for january 2014, I have used getURL to retrieve data from the ETHZ secure site.

  library("RCurl")
    library("XML")
     jan14 <- getURL("https://stat.ethz.ch/pipermail/r-help/2009-January/date.html",
                       ssl.verifypeer = FALSE)
 1)how can I parse jan14 file using htmltreeparse().
 2)how can I use the regular expressions to pull out the author lines and delete unwanted characters in the lines.
oguz ismail
  • 1
  • 16
  • 47
  • 69

2 Answers2

4

Or you could do it in a way more readable fashion with rvest (which ultimately uses RCurl) and CSS selectors:

library(rvest)

jan14 <- html("https://stat.ethz.ch/pipermail/r-help/2009-January/date.html")

authors <- jan14 %>% 
  html_nodes("li>i") %>% # CSS selector for <i> after <li>
  html_text() %>%        # get the text
  gsub("\\n", "", .)     # remove the newline for each author

tail(sort(table(authors)))

## authors
##  Wacek Kusnierczyk        jim holtman     Duncan Murdoch  Prof Brian Ripley 
##                 55                 80                 84                 84 
##    David Winsemius Gabor Grothendieck 
##                 93                116 

And, we can even add some dplyr and ggplot for good measure:

library(dplyr)
library(ggplot2)

dat <- data.frame(table(authors)) %>% arrange(-Freq)

gg <- ggplot(dat[1:25,], aes(x=reorder(authors, Freq), y=Freq))
gg <- gg + geom_bar(stat="identity")
gg <- gg + scale_x_discrete(expand=c(0,0))
gg <- gg + scale_y_continuous(expand=c(0,0))
gg <- gg + labs(x=NULL, y="# Posts", title="Top 25 Posters to R-help (Jan 2009)")
gg <- gg + coord_flip()
gg <- gg + theme_bw()
gg <- gg + theme(panel.grid=element_blank())
gg <- gg + theme(panel.border=element_blank())
gg

enter image description here

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
2

Retrieve the file. We must use getURL() because the schema is https:, otherwise we could have used doc <- htmlParse(url) directly.

url <- "https://stat.ethz.ch/pipermail/r-help/2009-January/date.html"
jan14 <- getURL(url, ssl.verifypeer = FALSE)

htmlParse() parses the text that we have just retrieved. It is the same as htmlTreeParse(), but easier to type.

doc <- htmlParse(jan14, asText=TRUE)

We do not need a regular expression to parse the text file; this would be error-prone and difficult. Instead we use XPath to identify the text value of italicized items inside lists; this is where the author names appear in the html.

who <- sapply(doc["//li/i/text()"], xmlValue)

who is a character vector of contributor names; the only 'unwanted' characters are white space characters (including new lines) at the end of each element. A regular expression matching one or more white space characters at the end of a character vector is [[:space:]]+$; we can use sub() to replace each occurrence with nothing (""). The table() function creates a table that counts how many times each author contributed. sort() takes this result and orders these from least to most frequent contributor. tail() returns the last (6 by default, we specify 10) entries.

tail(sort(table(sub("[[:space:]]+$", "", who))), 10)

The result is

> tail(sort(table(sub("[[:space:]]+$", "", who))), 10)

           Greg Snow Henrique Dallazuanna       hadley wickham 
                  35                   36                   40 
       Marc Schwartz    Wacek Kusnierczyk          jim holtman 
                  48                   55                   80 
      Duncan Murdoch    Prof Brian Ripley      David Winsemius 
                  84                   84                   93 
  Gabor Grothendieck 
                 116 
Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
  • @johnB I added explanation text. – Martin Morgan Dec 03 '14 at 01:55
  • thank you so much. U explained it brilliantly.... one last question : how could I list the 15 most frequent posters without displaying the frequency.could you provide me the code. @martin –  Dec 03 '14 at 02:18
  • @johnB maybe you can guess, based on the code above? try out your guess. – Martin Morgan Dec 03 '14 at 02:18
  • @martin-the table function will help us give us the count ,but as we need to list the most frequent posters without mentioning the frequency , the table function will not help –  Dec 03 '14 at 02:22
  • @johnB Here is a named vector `x = c(a=1, b=2)`. Investigate the `names()` function and use it to extract the names of the elements; apply your insights to the problem. – Martin Morgan Dec 03 '14 at 02:25