3

I am using Jupyter Notebook in Microsoft Azure. Since I cannot upload big files in Azure, I need to read it from a link. The csv file I want to read is in Kaggle.

I did this:

!pip install kaggle

import os

os.environ['KAGGLE_USERNAME'] = "*********"

os.environ['KAGGLE_KEY'] = "*********"

import kaggle

But I don't know how to read the file now. In other cases I use pandas to read files: file = pd.read_csv("file/link") and then I am able to clean and organize my data. But it is not working in this situation. Could you please help me?

I want to be able to read and manipulate the data as with the pd.read_csv because I need it for my project in Data Science. This is the dataset I want to be able to work with: https://www.kaggle.com/START-UMD/gtd#globalterrorismdb_0718dist.csv

Yana
  • 785
  • 8
  • 23

1 Answers1

3

Kaggle has already provided extensive documentation for their command line API here, which has been built using Python and the source can be found here so reverse engineering it is very straight forward in order to use Kaggle API pythonically.

Assuming you've already exported the username and key as environment variables

import os
os.environ['KAGGLE_USERNAME'] = '<kaggle-user-name>'
os.environ['KAGGLE_KEY'] = '<kaggle-key>'
os.environ['KAGGLE_PROXY'] = '<proxy-address>' ## skip this step if you are not working behind a firewall

or you've successfully downloaded kaggle.json from the API section in your Kaggle Account page and copied this JSON to ~/.kaggle/ i.e. the Kaggle configuration directory in your system.

Then, you can use the following code in your Jupyter notebook to load this dataset to a pandas dataframe:

  1. Import libraries
import kaggle as kg
import pandas as pd

  1. Download the dataset locally
kg.api.authenticate()
kg.api.dataset_download_files(dataset="START-UMD/gtd", path='gt.zip', unzip=True)
  1. Read the downloaded dataset
df = pd.read_csv('gt.zip/globalterrorismdb_0718dist.csv', encoding='ISO-8859-1')
  • 1
    Thank you. It worked this way. But I get this warning: /home/nbuser/anaconda3_420/lib/python3.5/site-packages/IPython/core/interactiveshell.py:2728: DtypeWarning: Columns (4,6,31,33,61,62,63,76,79,90,92,94,96,114,115,121) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result). – Yana Jul 17 '19 at 09:13
  • Also, could you please explain to me what is this line doing exactly, so I can apply it next time too: kg.api.dataset_download_files(dataset="START-UMD/gtd", path='gt.zip', unzip=True) ? – Yana Jul 17 '19 at 09:15
  • @Yana one your first question read this https://stackoverflow.com/a/27232309/1561981 On your second question, see https://github.com/Kaggle/kaggle-api/blob/775b1f74c64d139514f8beab2548ce4d62a0cf93/kaggle/api/kaggle_api_extended.py#L1112 – Ankush Chauhan Jul 19 '19 at 18:26
  • Ankush, Thank you! – Yana Jul 20 '19 at 05:38