The cnn_dailymail dataset contains 3 fields - ID,Text,Highlights
I wanted to get all the records in the cnn_dailymail dataset in a single csv , but have been unsuccessful in finding a way.
Currently I have downloaded the dataset locally from here (the file called cnn_stories.tgz).
I have unzipped the .tgz and got a folder of .story
files that has the text and summary for each record in the dataset. Because there are 100k records, I have got 100k .story
files
The problem with such extraction is I have got 100k story files , each has a text and it's summary. Ideally I wanted it in a csv format where there are 2 columns - one for the article and the next for the highlights -- and he csv to contain 100k rows.
I want to only do this using a locally downloaded dataset(due to proxy issues in my work system)
Alternative way to ask the question: How to use load_dataset() funtion from the datasets library to load a dataset from a locally downloaded .tgz
file