0

I am trying to use a python package inside shiny app to extract the maintext from a webpage: https://newspaper.readthedocs.io/en/latest/

what I mean by main text is the body of the article, without any adds, links, etc... (very similar to the "reader view" in safari on iphone).

To my knowledge, there is no similar package in r, if you know one please let me know.

The goal of this app is to allow the user to insert a web address, click submit and get the clean text as output.

please find the code below as well as the error message. I am using rstudio cloud.

This is the error:

Using virtual environment 'python3_env' ...
Warning in system2(python, c("-c", shQuote(command)), stdout = TRUE, stderr = TRUE) :
  running command ''/cloud/project/python3_env/bin/python' -c 'import sys; import pip; sys.stdout.write(pip.__version__)' 2>&1' had status 1
Warning in if (idx != -1) version <- substring(version, 1, idx - 1) :
  the condition has length > 1 and only the first element will be used
Warning: Error in : invalid version specification ‘’, ‘  ’
  52: stop
  51: .make_numeric_version
  50: numeric_version
  49: pip_version
  48: reticulate::virtualenv_install
  47: server [/cloud/project/python in shiny.R#42]
Error : invalid version specification ‘’, ‘  ’

and this is the code:

# Python webpage scraper followed by r summary: 


library(shiny)
library(reticulate)


ui <- fluidPage(
  
  sidebarLayout(
    sidebarPanel(
      
      textInput("web", "Enter URL:"),
      
      actionButton("act", "Submit") 
      
    ),
    
    
    mainPanel(br(),
              tags$head(tags$style(HTML("pre { white-space: pre-wrap; word-break: keep-all; }"))),
              verbatimTextOutput("nText"), 
              br() 
    )
  )
)


 
server <- function(input, output){
  
  #1) Add python env and packages: 
 
  reticulate::virtualenv_install('python3_env', packages = c('newspaper3k', 'nltk')) 
  
  
  py_run_string("from newspaper import Article")
  py_run_string("import nltk")
  py_run_string("nltk.download('punkt')")
  
  
  
  #2) Pull the webpage url: 
  webad <- eventReactive(input$act, {
    req(input$web)
    input$web
  })
  
  
 
  
  
  observe({
    
    py$webadd = webad
    
    py_run_string("article = Article('webadd')")
    
    py_run_string("article.download()")
    py_run_string("article.parse()")
    py_run_string("article.nlp()")
    py_run_string("ztext =article.text")
    py_run_string("r.ntexto = ztext")
    
    
    
    
    output$nText <- renderPrint({
      r.ntexto
    })
    
    
  })
  
}


 

shinyApp(ui = ui, server = server)
Bahi8482
  • 489
  • 5
  • 15
  • Out-of-the-box *newspaper3k* has some limitations. Each website has a different structure so some will work flawlessly and other not so much. Please provide a couple of URLs and I will see how to extract the content with *newspaper3k* . – Life is complex Apr 12 '21 at 14:19
  • @Lifeiscomplex thanks for looking at my question. I agree with you that this package is unlikely to fit all the websites structures. however, for this particular task, I need something generic even if it will not work with some. What I was trying to get help on from this question, is the appropriate code to use this package within a shiny app. Thanks again. – Bahi8482 Apr 23 '21 at 22:27
  • I'm not familiar with *shiny* or *r*, but I'm very familiar with newspaper. You would have to provide a couple of URLs for me to see how newspaper would work with them. If newspaper doesn't work there are other modules that would, but you would potentially have to build something from scratch. – Life is complex Apr 23 '21 at 22:40

0 Answers0