0

I am trying to plot the Countries life_expect['CountryName'] where the life expectancy life_expect['Value'] is over 80 from a 17001 rows × 6 columns DataFrame . The time it takes to process and plot is over 30 min . What am I doing wrong and how can I make the code faster ?

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

indicators = pd.read_csv('Indicators.csv')

year = indicators['Year']
hist_indicator = indicators['IndicatorName'].str.contains('Life expectancy at birth')
le = indicators[year & hist_indicator]
x = le[~le['CountryName'].str.contains('OECD')]
x = x[~x['CountryName'].str.contains('income')]
x = x[~x['CountryName'].str.contains('Euro')]
life_expect = x

highest_life_expect = life_expect[life_expect['Value'] > 80]
sns.set_style("darkgrid")
sns.catplot(x = 'CountryName', y = 'Value', hue = 'Year', row='IndicatorName', data = highest_life_expect, palette = 'YlOrBr', kind = 'strip', dodge = True, height = 1.5, aspect=3)
plt.xticks(rotation= 70)

plt.show() 

DataFrame sample :

life_expect = pd.DataFrame({'CountryName': {23215: 'Arab World', 23216: 'Arab World', 23217: 'Arab World', 23311: 'Caribbean small states', 23312: 'Caribbean small states', 23313: 'Caribbean small states', 23657: 'East Asia & Pacific (developing only)', 23658: 'East Asia & Pacific (developing only)', 23659: 'East Asia & Pacific (developing only)', 24249: 'Fragile and conflict affected situations'}, 'CountryCode': {23215: 'ARB', 23216: 'ARB', 23217: 'ARB', 23311: 'CSS', 23312: 'CSS', 23313: 
'CSS', 23657: 'EAP', 23658: 'EAP', 23659: 'EAP', 24249: 'FCS'}, 'IndicatorName': {23215: 'Life expectancy at birth, female (years)', 23216: 'Life expectancy at birth, male (years)', 23217: 'Life expectancy at birth, total (years)', 23311: 'Life expectancy at birth, female (years)', 23312: 'Life expectancy at birth, male (years)', 23313: 'Life expectancy at birth, total (years)', 23657: 'Life expectancy at birth, female (years)', 23658: 'Life expectancy at birth, male (years)', 23659: 'Life expectancy at birth, total (years)', 24249: 'Life expectancy at birth, female (years)'}, 'IndicatorCode': {23215: 'SP.DYN.LE00.FE.IN', 23216: 'SP.DYN.LE00.MA.IN', 23217: 'SP.DYN.LE00.IN', 23311: 'SP.DYN.LE00.FE.IN', 23312: 'SP.DYN.LE00.MA.IN', 23313: 'SP.DYN.LE00.IN', 23657: 'SP.DYN.LE00.FE.IN', 23658: 'SP.DYN.LE00.MA.IN', 23659: 'SP.DYN.LE00.IN', 24249: 'SP.DYN.LE00.FE.IN'}, 'Year': {23215: 1961, 23216: 1961, 23217: 1961, 23311: 1961, 23312: 1961, 23313: 1961, 23657: 1961, 23658: 1961, 23659: 1961, 24249: 1961}, 'Value': {23215: 48.461242672561596, 23216: 46.445430867169, 23217: 47.4276465406613, 23311: 64.82771091890339, 23312: 60.810503390166595, 23313: 62.7689979262271, 23657: 47.95864452502961, 23658: 44.11634167689369, 23659: 45.98732426589021, 24249: 43.4402244401312}})
Marius.T
  • 25
  • 7
  • is it possible to give us an example dataframe. We can scale that later to represents your condition. Kr – antoine Jan 13 '21 at 16:50
  • @antoine Is it ok to add the kaggle link to the dataframe ? – Marius.T Jan 13 '21 at 16:56
  • 1
    If my memory working, kaggle requires login. Many people might not have it. You will miss the useful help from them. May be just a a small code to create a sample of your df. – antoine Jan 13 '21 at 17:01
  • @antoine Thanks, I will add now a sample of it – Marius.T Jan 13 '21 at 17:02
  • something like `df = pd.DataFrame('col1': [ten values], ...)`, just to save the time for everyone :) – antoine Jan 13 '21 at 17:02
  • Have you timed what takes so long in your code? You imply that it is the plotting but maybe it is not. – Mr. T Jan 13 '21 at 17:05
  • @antoine I have added the ``` life_expect.head(10) ``` – Marius.T Jan 13 '21 at 17:17
  • @Marius. T, I tried with your data by replicate it to 17000 rows, yet, it take a few second for me. You may want to time every step in your code to be sure the real cause. Kr. – antoine Jan 14 '21 at 05:05
  • @antoine for me the execution time of `life_expect.head(10)` is 7.8 seconds . That's incredibly long . Any ideas why ? – Marius.T Jan 14 '21 at 09:13
  • We mentioned before that you have to time each step separately. Reading - conversion - plotting. You have them already nicely separated into three blocks in your code. If you know which step slows it down in your environment, you can start looking for reasons. Any specific environments like Jupyter notebook, PyCharm, Spyder, or Anaconda in which you run this? – Mr. T Jan 14 '21 at 09:34
  • I have timed everything and it's the reading of the initial csv file which is 5.5 mil rows long that takes just over 7 seconds . That time doesn't increase much to achieve my desired DataFrame `life_expect` , although if I try to run the plotting code it takes over 30 min . I run it on VisualStudioCode – Marius.T Jan 14 '21 at 09:36
  • If you run the code outside VisualStudio (double click on the file) does the problem remain? And it could still be the intermediate pandas conversion step that slows it down, although unlikely. – Mr. T Jan 14 '21 at 09:41
  • Trying to run it in Windows Power Shell been waiting for 4 min and it is still processing – Marius.T Jan 14 '21 at 09:52
  • @Mr.T, @antoine I have found the removing the `hue` parameter will bring the time to 10 seconds . Any ideas why is adding `hue ` so time consuming ? – Marius.T Jan 14 '21 at 11:11

0 Answers0