-1

I have stored a csv file in G drive and try to load it to torchtext data.TabularDataset. The error message is "FileNotFoundError: [Errno 2] No such file or directory: 'https://.....'"

Is it impossible to load csv file from g drive directly to torchtext TabularDataset?

Here is the code. I have also made a public colab notebook with data publicly available.

import torch
from torchtext import data, datasets

!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

TEXT = data.Field(tokenize = 'spacy', batch_first = True, lower=False)  
LABEL = data.LabelField(sequential=False, dtype = torch.float) 

train = data.TabularDataset(path = 'https://drive.google.com/open?id=1eWMjusU3H34m0uml5SdJvYX6gQuB8zta', 
                            format = 'csv', 
                            fields = [('Insult', LABEL), (None, None), ('Comment', TEXT)], 
                            skip_header=False)
Cass Zhao
  • 43
  • 4

1 Answers1

0

Let's assume you can afford to download this CSV file. I would suggest you to use a functionally built-in on torchtext: download_from_url.

import os
import torch
from torchtext import data, datasets
from torchtext.utils import download_from_url

# download the file
CSV_FILENAME = 'data.csv'
CSV_GDRIVE_URL = 'https://drive.google.com/uc?export=download&id=1eWMjusU3H34m0uml5SdJvYX6gQuB8zta'
download_from_url(CSV_GDRIVE_URL, CSV_FILENAME)

TEXT = data.Field(tokenize = 'spacy', batch_first = True, lower=False)  #from torchtext import data
LABEL = data.LabelField(sequential=False, dtype = torch.float) 

# if you're on Colab, you'll need this /content
train = data.TabularDataset(path=os.path.join('/content', CSV_FILENAME),
                            format='csv',
                            fields = [('Insult', LABEL), (None, None), ('Comment', TEXT)],
                            skip_header=False )

Notice that the Google Drive link should not be the one with open?id, but change it to uc?export=download&id.

Berriel
  • 12,659
  • 4
  • 43
  • 67
  • Thank you Berriel. Your code works without error message. But when I tried to print the zero and first examples using '''print(vars(train[0]),vars(train[1]),vars(train[2])) ''', It printed out: **{'Insult': 'null', 'Comment': ['null']} {'Insult': 'null', 'Comment': ['1']} {'Insult': '6', 'Comment': ['1']}**. Also, when try to build vob on it, there is a error message **AttributeError: 'Example' object has no attribute 'Comment'**. I have tried it on the [colab notebook](https://colab.research.google.com/drive/1HybMxIFILz5uHCEjLw3Ud2a7LSWcgPXU#scrollTo=z3mAT8JeJWnb) I have shared before. – Cass Zhao Apr 07 '20 at 11:02
  • @CassZhao but this is a different problem and should be asked in a different question. Consider upvoting if this answer was helpful. Note that **you** defined `skip_header=False`. I have no idea about the format of your CSV. The original question is about reading csv from google drive. – Berriel Apr 07 '20 at 13:08
  • @CassZhao not sure why you removed the accepted answer... you provided the wrong CSV file link. I fixed it for you now. Instead of `open` it should be `uc?export=download` – Berriel Apr 08 '20 at 12:25
  • Thanks for answering, it works only for this file. If change the id(only change the id, which means, as you said, using `uc?export=download&id` instead of `open?id`), it fetches the HTML again. Is there anything I need to take care of about the csv file? – Cass Zhao Apr 08 '20 at 17:15
  • @CassZhao if I recall correctly, the link-sharing should be publicly active. Other than that, I don't think this option works for directories. – Berriel Apr 09 '20 at 00:36