-1

I want to use a large dataset containing images from Kaggle in order to train a network I've created, however I have limited storage space on the machine I'm working on. I was wondering if there is any way to get the Kaggle dataset from a URL and load/read its images directly into a Python file and start training on it, without having to download the 5+ GB of data on my machine, since I don't have access to that space.

One of the datasets I want to use, is for example a Casia dataset: https://www.kaggle.com/datasets/sophatvathana/casia-dataset

I want something like:

url_casia = "https://www.kaggle.com/datasets/sophatvathana/casia-dataset/download?datasetVersionNumber=1"

response = requests.get(url_casia, stream=True)
# or something like: response = urllib.request.urlopen(url_casia)

img_list = np.array([cv2.imread(image) for image in response]

I know this doesn't work because the response content type is text/html; charset=utf-8, but I was wondering if there is any way to get the images in either a zipfile or anything else that is readable in python without actually downloading the zipfile.

seeckhout
  • 1
  • 1
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Сергей Кох Mar 16 '23 at 08:05
  • @СергейКох I have edited my question, I hope it is more clear now – seeckhout Mar 16 '23 at 11:13

1 Answers1

0

When you request data from a webpage, the response is loaded into your local or virtual machine's memory. To do this, you need the URL of each image. Then you can run this code for each URL:

resp = requests.get(
    url
    stream=True,
)
for chunk in resp.raw:
    print("Do something...")

This is basically webscraping and doesn't suit your use case I imagine.

The Kaggle Python API allows you to download the entire dataset locally which is probably a better option.