1

I am working in RStudio (RStudio 2023.03.0+386 "Cherry Blossom" Release) and trying to readLines() from an http address that I know is correct.

The code is as follows:

con <- url("http://biostat.jhsph.edu/~jleek/contact.html")
htmlCode <- readLines(con)
close(con)

And the error I get is:

Error in readLines(con) : 
    cannot open the connection to 'https://biostat.jhsph.edu/~jleek/contact.html'
In addition: Warning message:
  In readLines(con) :
    URL 'https://biostat.jhsph.edu/~jleek/contact.html': status was 'SSL connect error'

Following is the sessionInfo() output:

R version 4.2.3 (2023-03-15 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 

locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United 
States.utf8   
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RMySQL_0.10.25 DBI_1.1.3      sqldf_0.4-11   RSQLite_2.3.1  
gsubfn_0.7     proto_1.0.0    httpuv_1.6.9  
[8] httr_1.4.5     readr_2.1.4   

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.10      rstudioapi_0.14  magrittr_2.0.3   hms_1.1.3            
bit_4.0.5        R6_2.5.1        
 [7] rlang_1.1.0      fastmap_1.1.1    fansi_1.0.4      blob_1.2.4       
tcltk_4.2.3      tools_4.2.3     
[13] utf8_1.2.3       cli_3.6.0        bit64_4.0.5      tibble_3.2.0     
lifecycle_1.0.3  tzdb_0.3.0      
[19] later_1.3.0      vctrs_0.6.0      promises_1.2.0.1 cachem_1.0.7     
memoise_2.0.1    glue_1.6.2      
[25] compiler_4.2.3   pillar_1.9.0     chron_2.3-60     pkgconfig_2.0.3 
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • Please take a moment to read the [formatting help](https://stackoverflow.com/editing-help). Furthermore, the RStudio version is really only relevant when the issue is related to *RStudio*, which is unlikely to be the case here. The *R* version itself might be more relevant (though probably not in this case). – Konrad Rudolph Apr 22 '23 at 17:22
  • This is my first post, so thanks for the advice on formatting. R version is 4.2.3 (2023-03-15 ucrt). – Clifton Bell Apr 22 '23 at 17:25
  • Could you [edit](https://stackoverflow.com/posts/76080937/edit) your question by adding output of `sessionInfo()` from a fresh R session? – jay.sf Apr 22 '23 at 18:16
  • 1
    I added the sessionInfo() – Clifton Bell Apr 22 '23 at 18:31

1 Answers1

0

Actually your code works fine for me, but I'm running Linux, so it's hard to say. Perhaps you need to install OpenSSL.

You could try a different method in url,

con <- url("https://biostat.jhsph.edu/~jleek/contact.html", method='libcurl')
htmlCode <- readLines(con)
close(con)
head(htmlCode, 5)
# [1] "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">"
# [2] ""                                                                                                                 
# [3] "<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\">"                                        
# [4] ""                                                                                                                 
# [5] "<head>"    

or without url,

htmlCode <- readLines('https://biostat.jhsph.edu/~jleek/contact.html')
head(htmlCode, 1)
# [1] "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">"

or, as a workaround, try download the file first and read then (note, that download.file also has a method argument.).

tmp <- tempfile()
download.file('https://biostat.jhsph.edu/~jleek/contact.html', tmp)  
htmlCode <- readLines(tmp)
unlink(tmp)
head(htmlCode, 1)
# [1] "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">"

Or, use some packages out there, e.g.

XML::htmlTreeParse(RCurl::getURL('https://biostat.jhsph.edu/~jleek/contact.html'))$children$html
# <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
#   <head>
#   <meta name="Description" content="Welcome to Jeff Leek&apos;s Research Group"/>
# ...

Hope this helps.

jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • 1
    I greatly appreciate the response. Of the suggestions above, the one that works for me is to download.file() first and then read it. None of the variations of using readLines() directly on the html link work; not sure why. But the workaround was helpful. – Clifton Bell Apr 24 '23 at 21:10