-1

I have large dataset in json format from which I want to extract important attributes whcih captures the most variance. I want to extract these attributes to build a search engine on the dataset with these attributes being the hash key.

The main question being asked here is doing feature selection on a json data.

1 Answers1

0

You could read the data into a pandas DataFrame Object with the pandas.read_json() function. You can use this DataFrame Object to gain insight into your data. For example:

data = pandas.load_json(json_file)
data.head() # Displays the top five rows
data.info() # Displays description of the data

Or you can use matplotlib on this DataFrame to plot a histogram for each numerical attribute

import matplotlib.pyplot as plt
data.hist(bins=50, figsize=(20,15))

If you are interested into correlation of attributes, you can use the pandas.scatter_matrix() function.

You have to manually pick the attributes that fit best to your task and this tools help you to understand the data and gain insight into it.

ITiger
  • 1,056
  • 3
  • 11
  • 24