0

Here is a json file :

{
    "id": "68af48116a252820a1e103727003d1087cb21a32",
    "article": [
        "by mark duell .",
        "published : .",
        "05:58 est , 10 september 2012 .",
        "| .",
        "updated : .",
        "07:38 est , 10 september 2012 .",
        "a pet owner starved her two dogs so badly that one was forced to eat part of his mother 's dead body in a desperate attempt to survive .",
        "the mother died a ` horrendous ' death and both were in a terrible state when found after two weeks of starvation earlier this year at the home of katrina plumridge , 31 , in grimsby , lincolnshire .",
        "the barely-alive dog was ` shockingly thin ' and the house had a ` nauseating and overpowering ' stench , grimsby magistrates court heard .",
        "warning : graphic content .",
        "horrendous : the male dog , scrappy -lrb- right -rrb- , was so badly emaciated that he ate the body of his mother ronnie -lrb- centre -rrb- to try to survive at the home of katrina plumridge in grimsby , lincolnshire .",
        "the suffering was so serious that the female staffordshire bull terrier , named ronnie , died of starvation , nigel burn , prosecuting , told the court last friday .",
        "suspended jail term : the dogs were in a terrible state when found after two weeks of starvation at the home of katrina plumridge , 31 -lrb- pictured -rrb- .",
        "the male dog , her son scrappy , was so badly emaciated that he ate her body to try to survive .",
    ],
    "abstract": [
        "neglect by katrina plumridge saw staffordshire bull terrier ronnie die .",
        "dog 's son scrappy was forced to eat her to survive at grimsby house .",
        "alarm raised by letting agent shocked by ` thinnest dog he 'd ever seen '",
    ]
}

I have run df = pd.read_json('100252.json'), but I got the error : ValueError: arrays must all be same length

I then tried

with open('100252.json') as json_data: 
    data = json.load(json_data) 

pd.DataFrame.from_dict(data, orient='index').T.set_index('index')

but I got the error KeyError: "None of ['index'] are in the columns"

How can I solve this? I don't know where I got my errors. That's why I need your help

EDIT

source : https://huggingface.co/docs/datasets/loading_datasets.html

From this website, I want to do something similar to

>>> from datasets import Dataset
>>> import pandas as pd
>>> df = pd.DataFrame({"a": [1, 2, 3]})
>>> dataset = Dataset.from_pandas(df)

I have to transfer the json file into a dataframe and then get the dataset from pandas using datasets library

Michael
  • 19
  • 6
  • What are you trying to achieve? – Alexander Volkovsky May 26 '21 at 14:01
  • @AlexanderVolkovsky Let me edit my code to explain you what I want. – Michael May 26 '21 at 14:03
  • @AlexanderVolkovsky Do you have a better understanding – Michael May 26 '21 at 14:10
  • I don't understand the desired output. Are you trying to create a dataframe with columns `["id", "article", "abstract"]`? If so, you just need to replace an array with joined string – Alexander Volkovsky May 26 '21 at 14:20
  • Please post a snippet of the desired output dataframe. The article and abstract seem to be single documents split by sentence. Do you want to load each sentence on a single row, should all sentences be joined to end up in a single cell? It is unclear what the output should look like. – RJ Adriaansen May 26 '21 at 14:22
  • Why not to use `from datasets import load_dataset`? – Alexander Volkovsky May 26 '21 at 14:22
  • @AlexanderVolkovsky Yes, `["id", "article", "abstract"]` are my columns. I tried to use `from datasets import load_dataset`, but It doesn't work well for me locally. – Michael May 26 '21 at 14:59
  • @RJAdriaansen think Do you want to load each sentence on a single row, but it is not really important for now – Michael May 26 '21 at 15:13

1 Answers1

0

Dataset input must be a dict with equal-sized lists as values. So,

  1. Join sentences into one string and create an single-element list.
from datasets import Dataset
with open('100252.json') as json_data: 
    data = json.load(json_data)

data['id'] = [data['id']]
data['article'] = ["\n".join(data['article'])]
data['abstract'] = ["\n".join(data['abstract'])]

Dataset.from_dict(data)

Your dataset will contain a single row.

  1. align lists. for example fill with empty strings
max_len = max([len(data[col]) for col in ['article', 'abstract'] ])

data['id'] = [data['id']] * max_len
data['article'] = data['article'] + [""] * (max_len - len(data['article'])) 
data['abstract'] = data['abstract'] + [""] * (max_len - len(data['abstract'])) 
Dataset.from_dict(data)
Alexander Volkovsky
  • 2,588
  • 7
  • 13
  • Thanks for the answer! However, I don't have to join the sentence. It has to stay a list of sentences. Are you up to modify it – Michael May 26 '21 at 16:00