0

So, I have a list of webpages that I want to save as pdf. They are like http://nptel.ac.in/courses/115103028/module1/lec1/3.html. The list is very long that's why I am using python to automate the process. This is my code

import pdfkit
import urllib2

page = urllib2.urlopen('http://nptel.ac.in/courses/115103028/module1/lec1/3.html')

page_content = page.read()

with open('page_content.html', 'w') as fid:
    fid.write(page_content)

txt=open("page_content.html").read().split("\n")

txt1=""
for i in txt:
    if not ".html" in i:
        txt1+=i+"\n"

with open('page_content.html',"w") as f:
    f.write(txt1)


config = pdfkit.configuration(wkhtmltopdf="C:\Program Files (x86)\wkhtmltopdf\\bin\\wkhtmltopdf.exe")
pdfkit.from_file('page_content.html', 'out.pdf',configuration=config)

But the output pdf I am getting doesn't have any images of equations, just text. How do I solve this? Also, I'm opening the file 2nd time to remove the numbers from the top and bottom of the webpage, you can help me improve this too.

EDIT:

This is the code now I'm using

import os.path,pdfkit,bs4,urllib2,sys  
reload(sys)  
sys.setdefaultencoding('utf8')
url = 'http://nptel.ac.in/courses/115103028/module1/lec1/3.html'

directory, filename = os.path.split(url)

html_text = urllib2.urlopen(url).read()

html_text = html_text.replace('src="', 'src="'+directory+"/").replace('href="', 'href="'+directory+"/")

page = bs4.BeautifulSoup(html_text, "html5lib")
for ul in page.findAll("ul", {"id":"pagin"}):
    ul.extract() # Deletes the tag and everything inside it

html_text = str(page)
config = pdfkit.configuration(wkhtmltopdf="C:\Program Files (x86)\wkhtmltopdf\\bin\\wkhtmltopdf.exe")
pdfkit.from_string(html_text, "out.pdf", configuration=config)

It still showing those error, a part of the error message , And the output pdf does not have any images

Loading pages (1/6)
Warning: Failed to load http://nptel.ac.in/courses/115103028/css/style.css (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image041.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image042.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image043.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image045.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image046.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image048.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image049.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image050.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image051.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image052.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image053.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image054.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image055.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image056.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image057.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image064.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image065.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image067.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image068.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image069.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image070.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image071.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image072.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image073.png (ignore)
Warning: Failed to load http://nptel.ac.in/courses/115103028/module1/lec1/images/image074.png (ignore)
Warning: Failed to load file:///C:/Users/KOUSHI~1/AppData/images/1h.jpg (ignore)
Warning: Failed to load file:///C:/Users/KOUSHI~1/AppData/images/2h.jpg (ignore)
Warning: Failed to load file:///C:/Users/KOUSHI~1/AppData/images/3h.jpg (ignore)
Warning: Failed to load file:///C:/Users/KOUSHI~1/AppData/images/4h.jpg (ignore)
Warning: Failed to load file:///C:/Users/KOUSHI~1/AppData/images/5h.jpg (ignore)
Warning: Failed to load file:///C:/Users/KOUSHI~1/AppData/images/6h.jpg (ignore)
Warning: Failed to load file:///C:/Users/KOUSHI~1/AppData/images/7h.jpg (ignore)
Warning: Failed to load file:///C:/Users/KOUSHI~1/AppData/images/8h.jpg (ignore)
Warning: Failed to load file:///C:/Users/KOUSHI~1/AppData/images/9h.jpg (ignore)
Warning: Failed to load file:///C:/Users/KOUSHI~1/AppData/images/10h.jpg (ignore)
Counting pages (2/6)
Resolving links (4/6)
Loading headers and footers (5/6)
Printing pages (6/6)
Done
Eular
  • 1,707
  • 4
  • 26
  • 50
  • I also get these `1h.jpg`, `2h.jpg`, etc. warnings, but a look into the html source code shows that these images are preloaded, but not used: `` Pretty weird, perhaps someone else knows more about this? – BurningKarl Apr 01 '17 at 16:34
  • I don't get the other warnings. The other images are loaded without any problems. What about the final `out.pdf`? Does it just contain some of images or all or none? – BurningKarl Apr 01 '17 at 16:38
  • I have edited to give the complete error msg and the pdf does not have any images. – Eular Apr 01 '17 at 18:23

1 Answers1

2

When I run your code pdfkit outputs a lot of warnings, which look like this:

Warning: Failed to load file:///C:/Users/.../images/image041.png (ignore)

pdfkit tries to find the images in the website on my computer and because i didn't download them, they can not be found. A small hack around that problem is to convert the relative paths in the HTML source code to absolute paths:

import os.path

url = 'http://nptel.ac.in/courses/115103028/module1/lec1/3.html'

directory, filename = os.path.split(url)

html_text = urllib2.urlopen(url).read()

html_text = html_text.replace('src="', 'src="'+directory+"/") \
                     .replace('href="', 'href="'+directory+"/")

Here directory is the directory in which the website is found, in this example it is http://nptel.ac.in/courses/115103028/module1/lec1 and this way

<img src="images/image041.png" width="63" height="21">

becomes

<img src="http://nptel.ac.in/courses/115103028/module1/lec1/images/image041.png" width="63" height="21">

Now you can use pdfkit.from_string instead of pdfkit.from_file to create a PDF file without storing some temporary information:

pdfkit.from_string(html_text, "out.pdf", configuration=config)

To remove the links to the other pages (which are displayed as numbers) from the top and bottom of the site you have a ton of possibilities. My favorite one is using BeautifulSoup to find the ul tags with id="pagin". These tags contain the links to the other pages and you can just delete them:

import bs4

page = bs4.BeautifulSoup(html_text)
for ul in page.findAll("ul", {"id":"pagin"}):
    ul.extract() # Deletes the tag and everything inside it

html_text = unicode(page)

And now html_text does not contain those unwanted links anymore. To install BeautifulSoup just use pip: python -m pip install bs4

This solution obviously only works if all your websites are structured that way, if they are not you could also delete all a tags to get rid of those links, but be careful not to delete wanted information.

BurningKarl
  • 1,176
  • 9
  • 12