0

I need to do some data scraping and it would be very useful for me if there was a way to implement an algorithm that downloads a subset of images matching a certain query, but only within a specific 'root' website (for example all images in www.example.com including all subdirectories such as www.example.com/sub1).

I already know that it might be impossible to find all subdirectories in a root website unless they're listed somewhere. Since I do not know all the subdirectories I think i should avoid looping over subdirectories and extracting all images (with an online image extractor for instance).

So in my opinion the easiest thing to do is to let google do most of the work so that it outputs all (or maybe most) of the images that are contained in any subdirectory of the 'root' and then do a query.

The problem is thus divided in 2 parts:

  1. Get all the images from google image search that come from a specific website
  2. Only get the subset of images matching the query. This I guess would possible with some AI recognition (all images that are labeled as animals, or buildings and so on)

I know that this is a very broad question so i do not expect any answers with code.

What I would like to know is:

  1. Do you think it is even possible to do that?
  2. What programs would you suggest using for this purpose (both for the search and the image recognition)

If you think this question belongs more to another stack site let me know, I'm trying my best to be compliant with the rules. Thanks.

StackyGuy
  • 37
  • 7

0 Answers0