Lately, I've been trying to store the source code of some pages so I can later scrap what I need from them without having to worry about internet or possible anti-scraping measures. My first approach was to save the bs.prettify
object of each link into a column of the same DataFrame. After a while, I realized I can't navigate the parse tree on those objects (for example, accessing bs.h1
). So, I wanted to know if there's a way to turn the string from the bs.prettify
object into a navigable BeautifulSoup object or if there's a better way than storing into a DataFrame the source code for later use?
Asked
Active
Viewed 164 times
2

Juan C
- 5,846
- 2
- 17
- 51
-
4I usually store the HTML itself completely in a txt or html format if I have to use it later again. – Ankur Sinha Sep 11 '18 at 12:43
-
And then I can run BeautifulSoup() over that string? – Juan C Sep 11 '18 at 12:44
-
Yes, using the relevant parser :) – Ankur Sinha Sep 11 '18 at 12:44
-
Nice! that was exactly what I was looking for. Thanks ! – Juan C Sep 11 '18 at 12:46