-1

I tried to search for the fastest approach to work with large data files in colab. I began to wonder if it would be better to upload them directly from the site (e.g. Kaggle), or to upload them onto the colab own directory and work with them from there. I was able to do the latter, but when the files began to unzip, the system suddenly stopped working and crashed. I tried again, and next time I waited longer until everything was unzipped. However, on the second step the system crashed again.

Would you suggest the best way to work with (large) datasets without crashing the system?

The code I was using:

1)

First I made and copied a json file from Kaggle in the main directory of colab.

from google.colab import drive
drive.mount('/content/drive')

! pip install kaggle
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
! kaggle competitions download forest-cover-type-prediction

After that, I tried to unzip the data files, downloaded from Kaggle in the directory of Colab

! mkdir unzipped
! unzip train.csv.zip -d unzipped
! unzip test.csv.zip -d unzipped

and then read the data from the csv

import numpy as np
import pandas as pd

train = pd.read_csv("/content/unzipped/train.csv")
test = pd.read_csv("/content/unzipped/test.csv")

X = train.to_numpy()[100000:5000000,0:4].astype(float)
Y = train.to_numpy()[100000:5000000,4].astype(int).flatten()

Question: How to upload directly from the hard drive, and which method is faster?

Marko Kolaksazov
  • 73
  • 1
  • 1
  • 10
  • the I tried to import it directly from the hard drive, but it did not recognize it. – Marko Kolaksazov Oct 31 '21 at 15:18
  • Are you trying to mount your *local* drive to colab??? – desertnaut Oct 31 '21 at 21:23
  • Of course, no. What I want to show here is these two types of loading the data in the program. The first is uploaded data inside the colab directory, and unpacking it there, the second is to upload directly from the hard drive (local) which does not happen for unknoun reasons. And, the third, which I have not shown here, because I dont know how to do is directly from the site of the data. – Marko Kolaksazov Nov 01 '21 at 12:16
  • Actually, let me represent the question in this format: Which is the best way to import data FROM Kaggle TO Colab, and why (positives and negatives of each)? What I was looking for was the fastest and more reliable way to upload the data for use in the program. – Marko Kolaksazov Nov 01 '21 at 12:24
  • I am afraid questions about the "best" way to do anything are actually off-topic as opinion-based; and it would seem that the answer below addresses your third way of doing it (provided of course that you will run in in colab, not in your local machine). – desertnaut Nov 01 '21 at 12:30
  • OK, then let me say it in other way: which would be the 'fastest' way? (if you don't like the word 'best', then best=fast). I would like the explanation of at least these 3 ways (if you think, for example that from google drive will be the best, sorry the fastest then explain, please). – Marko Kolaksazov Nov 01 '21 at 12:48
  • About the answer of the question below: The posted answer is actually ONE of the ways I tried to do this - which was to download the files from Kaggle, then upload them in the main directory of Colab and then to unzip them at this place and use them directly in the program. However, this did not work because of lagging (I already explained in the question), is data too big or sth different idk. – Marko Kolaksazov Nov 01 '21 at 12:48
  • I am afraid you sound confused; the answer below does **not** involve downloading anything locally: the commands shown are to be run *in colab*, and the files will be downloaded only once from Kaggle directly to colab - no uploading whatsoever – desertnaut Nov 01 '21 at 15:01

1 Answers1

0

Try getting API token from Kaggle's account tab. Then upload it in the google colab and try the following code to Initialize the Kaggle library,

! pip install kaggle
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json

after the setup use the syntax below to download the dataset

! kaggle datasets download <name-of-dataset>

for more reference of the detailed work click here

Adhavan M
  • 1
  • 2