I'm using pandas profiling to make HTML reports of some dataframes with 150+ attributes. I'd like to extract some of the information and arrange it in a simple table. Specifically, I need the number of missing data in each variable 'n_missing', disctinct values and their corresponding percentages 'p_missing'.
Something I've done, but I'm struggling to get there is this:
df1 = pd.read_excel('df.xlsx')
profile = df1.profile_report(title="Dataset Profiling Report")
profile.to_file('dataset_report.html') #HTML report
profset = profile.description_set #Extracting the info from the profile object
print(profset.keys())
OUT[]: dict_keys(['analysis', 'table', 'variables', 'scatter', 'correlations', 'missing', 'messages', 'package', 'sample', 'duplicates'])
attributes = profset["variables"]
print(attributes.keys())
OUT: dict_keys(['Attribute 1', 'Attribute 2', 'Attribute 3', 'Attribute 4'...]) #All my columns or attributes.
I thought the "missing" key of the profile.description_set keys would be the one, but i get this:
missing = profset["missing"]
print(missing.keys())
dict_keys(['bar', 'matrix', 'heatmap', 'dendrogram'])
And none of these seem to be the right one.
I've been going through pandas profiling documentation and searching online, and only found a single possible example which led me to the previous code I've wrote.
The end goal is a table to export to excel that looks like this (ignore the actual values):
Attributes n_Missing p_missing n_disctinct p_disctinct
Attribute X 23 0.23 2 0.5
Attribute Y 50 0.50 50 1.0
Attribute Z 0 0.00 100 1.0
...