1

I try to learn matplotlib and stuck on some nuisance. I have these lines:

import os
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

current_dir = os.path.dirname(os.path.abspath(__file__))
csv_path = os.path.join(current_dir, "CSV\\")

df = pd.DataFrame()
df = df.append(pd.read_csv(csv_path + "MainData.csv"), sort=False)

periodB4 = "'2023-05-10' AND '2023-05-13'"

def makeStartEndDates(x):
    start_date, end_date = x.split(' AND ')
    start_date = start_date.strip()
    end_date = end_date.strip()
    return [start_date, end_date]

start_date_b4, end_date_b4 = makeStartEndDates(periodB4)

selected_df = df.iloc[:-5, :]

selected_df['date'] = pd.to_datetime(selected_df['date'], format='%Y-%m-%d')

b4period = selected_df.loc[selected_df['date'].between(start_date_b4, end_date_b4)]
# print(b4period)
plt.bar(b4period['date'], b4period['dau'])
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.xticks(rotation=90)
plt.xlabel('Category')
plt.ylabel('Value')
plt.title('Bar Chart Example')

plt.tight_layout()
plt.savefig('chart.png')

enter image description here

So basically I get excess date 2023-05-09 and all other dates are duplicated. And it is only in chart, can't see any of that in csv of df.

How can I avoid that? So that x axis will have dates from '2023-05-10' to '2023-05-13', and they will be shown only once?

Some complications with dates are needed to be used together with some other scripts, to work with BigQuery and SQL.

Here is a sample of csv:

enter image description here

Output of print(b4period.head(10).to_dict('list')):

{'date': [Timestamp('2023-05-10 00:00:00'), Timestamp('2023-05-11 00:00:00'), Timestamp('2023-05-12 00:00:00'), Timestamp('2023-05-13 00:00:00')], 'new_users': [2885.0, 2954.0, 3160.0, 4086.0], 'dau': [8627.0, 9112.0, 9318.0, 9327.0], 'wau': [28542.0, 28542.0, 28542.0, 28542.0]}
Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
Gwinbleid
  • 55
  • 7

2 Answers2

3
  • The issue is plt.bar creates a continuous x-axis, unlike pandas.DataFrame.plot with kind='bar', which is discrete and categorical.
  • selected_df['date'] = pd.to_datetime(selected_df['date'], format='%Y-%m-%d') should be selected_df['date'] = pd.to_datetime(selected_df['date'], format='%Y-%m-%d').dt.date, because the time component added by to_datetime is not relevant, and the change removes the need to format the x-axis ticklabels.
    • Alternatively, use b4period.date = b4period.date.dt.date just before plotting, so other methods, such as .between, aren't effected.
# sample data
data = {'date': [pd.Timestamp('2023-05-10 00:00:00'), pd.Timestamp('2023-05-11 00:00:00'), pd.Timestamp('2023-05-12 00:00:00'), pd.Timestamp('2023-05-13 00:00:00')],
        'new_users': [2885.0, 2954.0, 3160.0, 4086.0], 'dau': [8627.0, 9112.0, 9318.0, 9327.0], 'wau': [28542.0, 28542.0, 28542.0, 28542.0]}

b4period = pd.DataFrame(data)

# remove the time component
b4period.date = b4period.date.dt.date

# plot
ax = b4period.plot(kind='bar', x='date', y='dau', xlabel='Category', ylabel='Value', title='Example', rot=0, figsize=(6, 4), legend=False)

enter image description here

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
2

Given the above code, if you are looking to control the number of ticklabels to be equal to the number of unique dates (and so, the number of bars), why not control the number of ticks using MaxNLocator. Note that this will assign N-1 ticklabels to the plot. So, adding the below line after the set_major_formatter() line...

plt.gca().xaxis.set_major_locator(plt.MaxNLocator(len(b4period['date'].unique())+1))

...will give you, based on your code and the 4 lines of data (note I remove -5 in the iloc line), the below plot. Hope this is what you are looking for...

enter image description here

Redox
  • 9,321
  • 5
  • 9
  • 26
  • With bigger period and line plot got another inconvenience, that in order to avoid line starting exactly from the border of a figure (presumably) it adds 1 day to each end (if period 05.10-05.25 it starts from 05.09 and ends on 05.26). To avoid it, for me helped this: `plt.gca().xaxis.set_major_locator(plt.MaxNLocator(len(period['date'].unique())))` `plt.xlim(period['date'].iloc[0], period['date'][.iloc[-1])` – Gwinbleid Jun 23 '23 at 07:59