0

I'm working with BigQuery, and I'm interested in plotting a Lorenz curve (for inequality, related to the Gini coefficient).

How can I produce the data for a plot like this with SQL?

enter image description here

The curve is a graph showing the proportion of overall income or wealth assumed by the bottom x% of the people, although this is not rigorously true for a finite population (see below). It is often used to represent income distribution, where it shows for the bottom x% of households, what percentage (y%) of the total income they have. The percentage of households is plotted on the x-axis, the percentage of income on the y-axis. It can also be used to show distribution of assets. In such use, many economists consider it to be a measure of social inequality. https://en.wikipedia.org/wiki/Lorenz_curve

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325

1 Answers1

0

First define your data - in my case these are all the pages in Wikipedia and their number of pageviews (removing special pages too):

WITH wiki_prefixes AS (SELECT ['File:', 'Talk:', 'Template_talk:', 'Wikipedia:', 'Category:', 'User_talk:', 'Page:', 'Template:', 'Category_talk:' , 'User:', 'Author:', 'Portal:', 'Wikipedia_talk:', 'Portal_talk:', 'File_talk:', 'Draft:', 'Help:', 'Draft_talk:', 'en:', 'Book_talk:', 'Module:', 'MOS:', 'Special:', 'Book:'] x)

, data AS (
  SELECT *
  FROM `fh-bigquery.wikipedia_extracts.201912_en_totals`
  WHERE title NOT IN ('-', 'Main_Page')
  AND (
    title NOT LIKE '%:%'
    OR REGEXP_EXTRACT(title, '[^:]*:') NOT IN UNNEST((SELECT(x) FROM wiki_prefixes))
  )
)

Once you have the data you want to plot, then you can get each row into one of 100 or 1000 buckets - plus the cumulative percentage of the total that each represents with this query:


SELECT ROUND(100*cum_views/total_views,3) cum_percent, *
FROM (
  SELECT SUM(views) OVER(ORDER BY bucket) cum_views, *, SUM(views) OVER() total_views
  FROM (
    SELECT 1+fhoffa.x.int(rn/(SELECT (1+COUNT(*))/1000 FROM data)) bucket, COUNT(*) pages, SUM(views) views
      , STRING_AGG(title ORDER BY views DESC LIMIT 3) sample_titles
    FROM (
      SELECT title, views, ROW_NUMBER() OVER (ORDER BY views) rn
      FROM data
    )
    GROUP BY 1
  ) 
)

enter image description here

If you want to visualize it in Data Studio, check these steps:

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325