Data Exploration with SQL — Building a Pareto Table
⭐What is it:
The Pareto chart analysis is a statistical graphical technique used to map and rank business process problems starting from the most frequent to the least frequent with the goal of focusing efforts on the factors that produce the greatest impact overall and subsequently also focus on the ‘what and why’ factors that do not produce a big impact. To do this effectively, it utilizes the Pareto Principle, which is most predominantly known as the 80/20 rule. It is named after Italian economist Vilfredo Pareto, who formulated 80–20 principle, which specifies that 80% (roughly) of consequences come from 20% of the causes.
⭐Plotting:
A Pareto Chart contains a mix of a bar graph and a line chart.
Individual data points are represented using bars, and these bars are arranged in descending order — from the largest to the smallest. A curved line is used to represent the cumulative total of the data set.
⭐Steps for building a Pareto table in SQL:
🌟Identify the cause(category/dimension) and measurement that you wish to analyze:
▶Which sub-categories are responsible for most Sales? Or
▶Which products effect Profit the most and what’s their percentage contribution?
🌟Create an ordered (descending) frequency distribution table using GROUP BY
🌟 Generate cumulative sum (Running Total) using Window Functions
🌟Generate percentage contribution of each cause
🌟Generate cumulative sum of percentage (Running Total) using CTE and Window Functions
Dataset used: Superstore
Edit: A minimum of 30 values gives good representation of Pareto principle. This example uses much less i.e. 17 values. The goal of this post was to introduce the data community with building complex charts in SQL. And fit the code / results in one screen to observe cumulative percentages totaling to 100%.
If interested, try plotting with product name ( instead of sub category).