Why you should learn SQL Window Functions

ยท

4 min read

What is an SQL Window Function???

A window function is a function that uses values from one or multiple rows to return a value for each row. Like an aggregate function used with the GROUP BY clause, a window function also performs calculations for a set of rows. The difference between an aggregate function and a window function is simple. An aggregate function collapses all the rows into a single result, which means you lose access to the individual rows.

We all are aware of this well-known syntax GROUP BY. If you have been writing queries for a long time you are no stranger to the fact that this clause has some limitations, and could give us errors while executing some types of queries.

To give you guys an example I have chosen this dataset from Kaggle called "Unicorn Companies", Short preview of the dataset -

Screenshot 2022-04-02 094334.jpg

Now we want to execute a query that can give us the top 5 countries with the highest number of companies, for this, we are going to try the GROUP BY clause

select Country, count(*) Total_Companies_Per_Country_Count from Unicorn_Companies
group by Country
order by count(*) desc

top 5 coutries.jpg It worked!!! Now let's include the company's names and their valuations along with the above output.

select Company, Valuation_B, Country, count(*) Total_Companies_Per_Country_Count from Unicorn_Companies
group by Country
order by count(*) desc

OOPS!!!!!

Msg 8120, Level 16, State 1, Line 10 Column 'Unicorn_Companies.Company' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

Group by clause worked fine as long we were using columns that were related to GROUP BY clause or aggregate functions.

To get the desired result we'll use the OVER function along with PARTITION BY, these are the most used functions in window functions

select 
    Company,
    Valuation_B,
    country,
    count(country) over(partition by country order by country) Total_Companies_Per_Country_Count
from unicorn_companies

valu_comp_over.jpg

Now it is working but the order is still not correct. We need this result in descending order, to do that we can subquery the above query and order by the Total_Companies_Per_Country_Count

select * from (select 
    Company,
    Valuation_B,
    country,
    count(country) over(partition by country order by country) Total_Companies_Per_Country_Count
from unicorn_companies) a
order by Total_Companies_Per_Country_Count desc

over().jpg

I tried to find a better solution to get the above result but unfortunately didn't find any. If you guys know please let me know.

Now, with the window function, we can do much more than replace group by clause. we can find the rank of companies based on their valuation. To do that I first have to clean the valuation column, it will not work if we have the '$' symbol prepended to every valuation. To get rid of that symbol we'll use the REPLACE function - replace(Valuation_B,'$', '') Now since we are cleaning our dataset I will clean the total_raised column as well. This is the query I used to clean the data and stored it in a different table called unicorn_companies_clean -

SELECT * INTO unicorn_companies_clean FROM (SELECT [Company]
      ,replace(Valuation_B,'$', '')Valuation_B
      ,[Date_Joined]
      ,[Country]
      ,[City]
      ,[Industry]
      ,[Select_Inverstors]
      ,[Founded_Year]
      ,replace(replace(replace([Total_Raised], '$', ''), 'M', ''), 'B', '') Total_Raised
      ,[Financial_Stage]
      ,[Investors_Count]
      ,[Deal_Terms]
      ,[Portfolio_Exits]
  FROM [New].[dbo].[Unicorn_Companies]) A

cleaned.jpg

Now that we have the data cleaned, we will resume our window function session. To find out the rank of valuation from highest to lowest we can use RANK() with our OVER() function -

select 
    Company,
    Valuation_B,
    country,
    RANK() OVER(order by Valuation_B desc) Rank
from unicorn_companies_clean

rank.jpg

YEAH!!!!!

So these were some examples of how useful Window Function can be. Like Rank and Count, we can find much more with this, like SUM(), MAX(), MIN(), AVG(),DENSE_RANK(), ROW_NUMBER(), NTILE(),LAG(), and so much more.

Here are some of the resources I used to learn and write this blog, feel free to check out these resources -

Written resource:

Videos:

If you want more structured learning here are some courses I think might be helpful -

ย