There seems to be a problem with your understanding of the concept of embedded images. The url
you have posted is, actually, what your browser returns when you select 'View Image' or 'Copy Image Location' (or something similar, depending on the browser) from the context menu, and formally is called a data URI.
It is not an http url pointing to an image, and you can not use it to retrieve actual images from any server: this is exactly what requests
points out in the error message.
So, how do we get these images?
The following script will handle this task:
import requests
from lxml import html
import binascii as ba
i = 0
url="<Page URL goes here>" #Ex: http://server/dir/images.html
page = requests.get(url)
struct = html.fromstring(page.text)
images = struct.xpath('//img/@src')
for img in images:
i += 1
ext = img.partition('data:image/')[2].split(';')[0]
with open('newim'+str(i)+'.'+ext,'wb') as f:
f.write(ba.a2b_base64(img.partition('base64,')[2]))
print("Done")
To run it you will need to install, along with requests
, the lxml library which can be found here.
Here follows a short description of how the script functions:
First it requests the url
from the server and, after it gets the server's response, it stores it in a Response object (page
).
Then it utilizes html.fromstring()
from lxml to transform the "textified" content of page
into a tree-structure which can be processed by commands utilizing XPath syntax, like this one: images = struct.xpath('//img/@src')
.
The result is a list
containing the contents of the src
attribute of every image in the page. In this case (embedded images) these are the data URIs.
Then, for every image in the list, it first gets the image type (which will be used as the newim
's extension), using partition()
and split()
and stores it in ext
. Then it converts the base64 encoded data to binary (using a2b_base64()
from binascii module) and writes the output to the file.
As a small demo, save this html
code (as, eg, images.html) somewhere in your server
<h1>Images</h1>
<img src="" />
<br />
<img src=""></img>
<br />
<img src=""/>
and point to it in the script: requests.get("http://yourserver/somedir/images.html")
.
When you run the script you will get the following 3 images:
,
,
, respectively named newim1.png
, newim2.png
and newim3.jpg
.
As a reminder, do note that this script (in its current form) will only handle embedded images. If you want to process also ordinary linked images, then you have to modify it accordingly (but this is not difficult).