1

For this website: https://www.coinopsy.com/dead-coins/, I'm using R and the rvest package to scrape names, summary, etc., that kind of info, to make my own form. I've done this with other websites and it was really successful, but this one is odd.

I used SelectorGadget, which is useful, in my previous jobs, to figure out the css nodes' names, but html_nodes and html_text return empty character, I don't know if it's because the website is structured under a totally different format!

An example of the css code:

td class="all sorting_1">a class="coin_name" href="007coin">007Coin /a>/td>

a class="coin_name" href="007coin">007Coin /a>

url <- "https://www.coinopsy.com/dead-coins/"

webpage <- read_html(url)

Item_html <- html_nodes(webpage,'.coin_name')

Item <- html_text(Item_html)

> Item

character(0)

Can someone help me out on this issue?

MAXWILL
  • 27
  • 4

2 Answers2

0

If you disable javascript in the browser you will see that that content is not loaded. If you then inspect the html you will see the data is stored in a script tag; presumably loaded into the table when javascript runs in the browser. Javascript doesn't run with the method you are using. You can extract the javascript array of arrays from the response html. Then parse into a dataframe. I am new to R so looking into how this can be done in this case. I will include a full example with python at the end. I will update if my research yields something. Otherwise, you can regex out contents from returned string in data.

library(rvest)
library(stringr)
library(magrittr)

url = 'https://www.coinopsy.com/dead-coins/'
r <- read_html(url) %>%
  html_node('body') %>%
  html_text() %>%
  toString()
data <- str_match_all(r,'var table_data = (.*?);')
data <- data[[1]][,2]  # string representation of list of lists
#step to convert string to object
#step to convert object to dataframe

In python there is the ast library which makes the conversion easy and the result of the below is the table you see on the page.

import requests
import re
import ast
import pandas as pd

r = requests.get('https://www.coinopsy.com/dead-coins/')
p = re.compile(r'var table_data = (.*?);')   #p1 = re.compile(r'(\[".*?"\])')
data = p.findall(r.text)[0]
listings = ast.literal_eval(data)
df = pd.DataFrame(listings)
print(df)

Edit:

Currently I can't find a library which does the conversion I mentioned. Below is ugly way of combining and feels inefficient. I would welcome suggestions on improvements (though that may be for code review later). I'm still looking at this so will update.

library(rvest)
library(stringr)
library(magrittr)

url = 'https://www.coinopsy.com/dead-coins/'
headers <- c("Column To Drop","Name","Summary","Project Start Date","Project End Date","Founder","urlId")
# https://www.coinopsy.com/dead-coins/bigone-token/  where bigone-token is urlId

r <- read_html(url) %>%
  html_node('body') %>%
  html_text() %>%
  toString()
data <- str_match_all(r,'var table_data = (.*?);')
data <- data[[1]][,2]

z <- substr(data, start = 2, stop = nchar(data)-1) %>% str_match_all(., "\\[(.*?)\\]")
z <- z[[1]][,2]

for(i in seq(1,length(z))){
  if(i==1){
    df <- rapply(as.list(strsplit(z[i], ",")[[1]][2:7]), function(x) trimws(sub("'(.*?)'", "\\1", x)))
  }else{
    df <- rbind(df,rapply(as.list(strsplit(z[i], ",")[[1]][2:7]), function(x) trimws(sub("'(.*?)'", "\\1", x))))
  }
}

QHarr
  • 83,427
  • 12
  • 54
  • 101
  • Is this along the lines of what you were after? – QHarr Jul 18 '19 at 13:00
  • 1
    Sorry for the late response, your codes worked out perfectly fine, I don't think it's an ugly way at all, that's pretty much we can do right now, I have some questions understanding your codes maybe in the future want to bother you more when I take a deep look at that, wish the best to you.. – MAXWILL Jul 23 '19 at 00:16
0

maybe it will help someone, I had the same problem, the solution was that at the beginning I have to specify the label to which the script is to be directed followed by the ".". In your case you want to address a class named coin_name, when specifying that class in the html_nodes function you don't specify the tag, same as I did. To solve it, I only had to specify the label, which in your case is the "a" label, and it would look like this.

Item_html <- html_nodes(webpage,'a.coin_name')

That way the html_nodes function would not return empty. I know you already solved it but I hope someone can help you.

Deither
  • 1
  • 1