0

The cnn_dailymail dataset contains 3 fields - ID,Text,Highlights

I wanted to get all the records in the cnn_dailymail dataset in a single csv , but have been unsuccessful in finding a way.

Currently I have downloaded the dataset locally from here (the file called cnn_stories.tgz). I have unzipped the .tgz and got a folder of .story files that has the text and summary for each record in the dataset. Because there are 100k records, I have got 100k .story files

The problem with such extraction is I have got 100k story files , each has a text and it's summary. Ideally I wanted it in a csv format where there are 2 columns - one for the article and the next for the highlights -- and he csv to contain 100k rows.

I want to only do this using a locally downloaded dataset(due to proxy issues in my work system)

Alternative way to ask the question: How to use load_dataset() funtion from the datasets library to load a dataset from a locally downloaded .tgz file

newbie101
  • 65
  • 7

1 Answers1

0

You could run the following code in Google Colab (takes time!) then either download the files or save then to your Google Drive.

import tensorflow_datasets as tfds
cnn_builder = tfds.summarization.cnn_dailymail.CnnDailymail()
cnn_info = cnn_builder.info
cnn_builder.download_and_prepare()
datasets = cnn_builder.as_dataset()
train_dataset, test_dataset = datasets["train"], datasets["test"]
print(len(train_dataset))
print(len(test_dataset))
train_df = tfds.as_dataframe(train_dataset)
test_df = tfds.as_dataframe(test_dataset)
train_df.to_csv("daily_mail_train.csv")
test_df.to_csv("daily_mail_test.csv")

Hope this helps!

Jovi DSilva
  • 216
  • 3
  • 14
  • Apologies mate - tensorflow_datasets is blocked for me internally (in my work system).. Would you know a way to directly load a dataset from *cnn_dailymail.tgz* file that I have downloaded locally ? – newbie101 Oct 18 '22 at 09:51
  • Run it in Google Colab dont run it locally. Then you can download the csv files generated – Jovi DSilva Oct 18 '22 at 11:40
  • Your other choice is to manually process and create the csv files as detailed here `https://github.com/abisee/cnn-dailymail` – Jovi DSilva Oct 18 '22 at 11:42