2

I am interested in data mining and I am writing my thesis about it. For my thesis I want to use yelp's data challenge's data set, however i can not open it since it is in json format and almost 2 gb. In its website its been said that the dataset can be opened in phyton using mrjob, but I am also not very good with programming. I searched online and looked some of the codes yelp provided in github however I couldn't seem to find an article or something which explains how to open the dataset, clearly. Can you please tell me step by step how to open this file and maybe how to convert it to csv?

https://www.yelp.com.tr/dataset_challenge

https://github.com/Yelp/dataset-examples

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
Bengi Koseoglu
  • 159
  • 4
  • 10
  • 1
    Welcome to Stack Overflow. For future reference, these kinds of questions are not what this site is all about. SO is a place for asking specific questions about specific programming problems. I invite you to browse the site in order to get a better sense of what we are all about here. For now, I am going to recommend that this question be closed. – McMath Feb 23 '16 at 21:40
  • I would research the R language. – Matt Feb 23 '16 at 21:44
  • I would avoid the R language because it is horrible for dealing with 2GB of data. Python *is* the better choice here. – Has QUIT--Anony-Mousse Feb 24 '16 at 06:42
  • @Bengi Koseoglu, are you able to extract the dataset or export it to csv file ? – goofyui Nov 10 '16 at 15:51

3 Answers3

5

data is in .tar format when u extract it again it has another file,rename it to .tar and then extract it.you will get all the json files

KS HARSHA
  • 67
  • 2
  • 7
3

yes you can use pandas. Take a look:

import pandas as pd

# read the entire file into a python array
with open('yelp_academic_dataset_review.json', 'rb') as f:
    data = f.readlines()

# remove the trailing "\n" from each line
data = map(lambda x: x.rstrip(), data)


data_json_str = "[" + ','.join(data) + "]"

# now, load it into pandas
data_df = pd.read_json(data_json_str)

Now 'data_df' contains the yelp data ;) Case, you want convert it directly to csv, you can use this script

https://github.com/Yelp/dataset-examples/blob/master/json_to_csv_converter.py

I hope it can help you

1

To process huge json files, use a streaming parser.

Many of these files aren't a single json, but a stream of jsons (known as "jsons format"). Then a regular json parser will consider everything but the first entry to be junk.

With a streaming parser, you can start reading the file, process parts, and wrote them to the desired output; then continue writing.

There is no single json-to-csv conversion.

Thus, you will not find a general conversion utility, you have to customize the conversion for your needs.

The reason is that a JSON is a tree but a CSV is not. There exists no ultimative and efficient conversion from trees to table rows. I'd stick with JSON unless you are always extracting only the same x attributes from the tree.

Start coding, to become a better programmer. To succeed with such amounts of data, you need to become a better programmer.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194