extract table from webpage using R

Question

I am trying to extract all the table from this page using R, for html_node i had passed "table". In console the output is weird. Data is available in webpage but in R console it shows NA. Please suggest me where i had made mistake.

library(xml2)
library(rvest)
url <- "https://www.iii.org/table-archive/21110" 
page <- read_html(url) #Creates an html document from URL
table <- html_table(page, fill = TRUE) #Parses tables into data frames
table

part of Output: X4 X5 X6

  1  Direct premiums written (1) Market share (2)  1
    2             Market share (2)             <NA> NA
    3                        10.6%             <NA> NA
    4                          6.0             <NA> NA
    5                          5.4             <NA> NA
    6                          5.4             <NA> NA
    7                          5.2             <NA> NA
    8                          4.5             <NA> NA
    9                          3.3             <NA> NA
    10                         3.2             <NA> NA
    11                         3.0             <NA> NA
    12                         2.2             <NA> NA
                                           X7          X8    X9 X10
    1  State Farm Mutual Automobile Insurance $51,063,111 10.6%   2
    2                                    <NA>        <NA>  <NA>  NA
    3                                    <NA>        <NA>  <NA>  NA
    4                                    <NA>        <NA>  <NA>  NA
    5                                    <NA>        <NA>  <NA>  NA
    6                                    <NA>        <NA>  <NA>  NA
    7                                    <NA>        <NA>  <NA>  NA
    8                                    <NA>        <NA>  <NA>  NA
    9                                    <NA>        <NA>  <NA>  NA
    10                                   <NA>        <NA>  <NA>  NA
    11                                   <NA>        <NA>  <NA>  NA
    12                                   <NA>        <NA>  <NA>  NA

Instead of code image add your code in the question – IMParasharG Mar 10 '19 at 20:59 — IMParasharG, Mar 10 '19 at 20:59

AC0519 · Accepted Answer · 2019-03-11T11:52:47.600

This will get all of the tables into a single data frame:

library(tidyverse)
library(rvest)


url <-  "https://www.iii.org/table-archive/21110"

df <- url %>% 
  read_html() %>% 
  html_nodes("table") %>% 
  html_table(fill = T) %>% 
  lapply(., function(x) setNames(x, c("Rank", "Company", "Direct_premiums_written", 
                                      "Market_share")))

tables <- data.frame()

for (i in seq(2,18,2)) {
    temp <- df[[i]] 
  tables <- bind_rows(tables, temp)
}

You can then subset this however you want. For example, lets extract the information from the third table that represents 2009:

table_2009 <- tables[21:30,] %>% 
          mutate(Year = 2009)

To add all the years at once:

years <- c(2017, 2008, 2009, 2010, 2011, 2013, 2014, 2015, 2016)
tables <- tables %>% 
          mutate(Year = rep(years, each = 10))

Hope this helps.

How can i set specific year to each row? as you above for 2009 year, I need add year for the "tables", Like for first 10 rows value of year is 2017 and for next 10 year value is 2008 like that for 90 rows — azhagan, Mar 11 '19 at 10:09

score 1 · Answer 2 · answered Mar 10 '19 at 21:32

There are a couple of issues with these tables.

First, I think you'll get better results if you specify the class of table. In this case, .tablesorter.

Second, you'll note that in some tables the second column header is Group, in other cases it is Group/company. This is what causes the NA. So you need to rename the columns to be consistent for all tables.

You can get a list of tables with renamed column headers like this:

tables <- page %>% 
  html_nodes("table.tablesorter") %>% 
  html_table() %>% 
  lapply(., function(x) setNames(x, c("rank", "group_company", 
                                      "direct_premiums_written", "market_share")))

Looking at the web page we see that the tables are for years 2017, 2008 to 2011 and 2013 to 2016. So we could add these years as names to the list then bind the tables together with a column for year:

library(dplyr)
tables <- setNames(tables, c(2017, 2008:2011, 2013:2016)) %>% 
  bind_rows(.id = "Year")

I am getting below error while i am trying to bind year to tables. Error in setNames(tables, c(2017, 2008:2011, 2013:2016)) : 'names' attribute [9] must be the same length as the vector [8] — azhagan, Mar 11 '19 at 09:44
That's telling you there are more names (9) than tables. The list should be length 9 using the code given and your example URL. — neilfws, Mar 11 '19 at 09:57

IRTFM · Answer 3 · 2019-03-10T21:45:09.900

There are multiple items in the list that you have named table. (Not a good practice: there's a function by that name.)

str(tbl)
List of 18
 $ :'data.frame':   12 obs. of  45 variables:
  ..$ X1 : chr [1:12] "Rank\nGroup/company\nDirect premiums written (1)\nMarket share (2)\n1\nState Farm Mutual Automobile Insurance\n"| __truncated__ "Rank" "1" "2" ...
  ..$ X2 : chr [1:12] "Rank" "Group/company" "State Farm Mutual Automobile Insurance" "Berkshire Hathaway Inc." ...
  ..$ X3 : chr [1:12] "Group/company" "Direct premiums written (1)" "$64,892,583" "38,408,251" ...
snippped rest of long output

Perhaps you only want the last one?

tbl[[18]]
   Rank                            Group/company
1     1   State Farm Mutual Automobile Insurance
2     2                  Berkshire Hathaway Inc.
3     3                           Liberty Mutual
4     4                           Allstate Corp.
5     5                        Progressive Corp.
6     6                 Travelers Companies Inc.
7     7                               Chubb Ltd.
8     8                  Nationwide Mutual Group
9     9 Farmers Insurance Group of Companies (3)
10   10                     USAA Insurance Group
   Direct premiums written (1) Market share (2)
1                  $62,189,311            10.2%
2                   33,300,439              5.4
3                   32,217,215              5.3
4                   30,875,771              5.0
5                   23,951,690              3.9
6                   23,918,048              3.9
7                   20,786,847              3.4
8                   19,756,093              3.2
9                   19,677,601              3.2
10                  18,273,675              3.0

Nope; going back to the page it's clear you want the first, but its structure appears to have been misinterpreted and the data has been arranged as "wide", with all the data residing in the first row. So some of the columns are being displayed and the rest of the data seems to be messed up; Just take columns 2:4:

tbl[[1]][ ,c('X2','X3','X4')]
                                         X2                          X3
1                                      Rank               Group/company
2                             Group/company Direct premiums written (1)
3    State Farm Mutual Automobile Insurance                 $64,892,583
4                   Berkshire Hathaway Inc.                  38,408,251
5                            Liberty Mutual                  33,831,726
6                            Allstate Corp.                  31,501,664
7                         Progressive Corp.                  27,862,882
8                  Travelers Companies Inc.                  24,875,076
9                                Chubb Ltd.                  21,266,737
10                     USAA Insurance Group                  20,151,368
11 Farmers Insurance Group of Companies (3)                  19,855,517
12                  Nationwide Mutual Group                  19,218,907
                            X4
1  Direct premiums written (1)
2             Market share (2)
3                        10.1%
4                          6.0
5                          5.3
6                          4.9
7                          4.3
8                          3.9
9                          3.3
10                         3.1
11                         3.1
12                         3.0

extract table from webpage using R

3 Answers3

Linked

Related