-1

Using

library(htm2txt)
url <- 'https://en.wikipedia.org/wiki/Alan_Turing'
clear.text <- gettxt(url)

code i'm getting

clear.text
[1] "Alan Turing\n\nFrom Wikipedia, the free encyclopedia\n\nJump to navigation\tJump to search\n\n\"Turing\" redirects here. For other uses, see Turing (disambiguation).\n\nmathematician and computer scientist\n\nAlan Turing\n\nOBE FRS\n\nTuring aged 16\n\nBorn (1912-06-23)23 June 1912\n\nM...

and this data i would like to store in tidy object like in:

tidy.text <- tidy(clear.text)

but i get

'tidy.character' is deprecated.

and result is

# A tibble: 1 x 1
                                                                                 x
                                                                             <chr>
1 "Alan Turing\n\nFrom Wikipedia, the free encyclopedia\n\nJump to navigation\tJum
> 

How can i therefore converd such a plain text to tidy format?

Thank You for any advance.

kwadratens
  • 187
  • 15
  • 1
    The output of `sessionInfo()` in a code block would be handy as well as all the necessary `library()` calls to reproduce your example. Also, _please_ consider using `textreadr::read_html` instead of that `htm2txt` package since that `htm2txt` package is super dangerous (it uses regular expressions to destroy HTML content and will likely end up hurting you in the long run) – hrbrmstr Nov 29 '18 at 15:10
  • What do you mean by a "tidy object"? I don't have `htm2txt` installed, but the deprecation warning says you're calling `tidy` on a character vector. What's the output you're trying to get? – camille Nov 29 '18 at 15:24

1 Answers1

0

If you have a Wikipedia link or other HTML, the unnest_tokens() function in tidytext can parse and tidy it directly.

library(tidytext)
library(tidyverse)

read_lines("https://en.wikipedia.org/wiki/Alan_Turing") %>%
  data_frame(text = .) %>%
  unnest_tokens(word, text, format = "html")

#> # A tibble: 15,460 x 1
#>    word     
#>    <chr>    
#>  1 alan     
#>  2 turing   
#>  3 wikipedia
#>  4 this     
#>  5 is       
#>  6 a        
#>  7 good     
#>  8 article  
#>  9 follow   
#> 10 the      
#> # ... with 15,450 more rows

Created on 2018-12-18 by the reprex package (v0.2.1)

Julia Silge
  • 10,848
  • 2
  • 40
  • 48
  • I know this no right place to ask , I don't know how to contact you , Could you kindly answer [this](https://datascience.stackexchange.com/questions/44894/how-to-switch-career-from-data-analyst-to-data-engineer-with-no-programming-expe) question regarding `Data Engineer` career guidance. – Shaiju T Feb 01 '19 at 06:43