SQL: PERCENTILE

percentile_disc(p) within group (over order by val), with 0<= p <= 1, returns a value r such that p*100 percent of the values of val are less than r, within a given group (thus the clause within group).

Test data

First, we need to create a table …

create table tq84_p (
   item      varchar(10),
   val_one   integer    ,
   val_two   integer   
);

Github repository about-SQL, path: /select/analytic/percentile/disc/create-table.sql

… and fill it with some simple test data:

insert into tq84_p values ('foo',  0,       1);
insert into tq84_p values ('foo',  1,       7);
insert into tq84_p values ('foo',  2,       3);
insert into tq84_p values ('foo',  3,       3);
insert into tq84_p values ('foo',  4,       4);
insert into tq84_p values ('foo',  5,      10);
insert into tq84_p values ('foo',  6,      13);
insert into tq84_p values ('foo',  7,      11);
insert into tq84_p values ('foo',  8,      14);
insert into tq84_p values ('foo',  9,      13);
insert into tq84_p values ('foo', 10, 9999999);

Github repository about-SQL, path: /select/analytic/percentile/disc/insert-1.sql

Calculate some percentiles

I now let SQL calculate the 20th, 40th and 70th percentile of the data.

select
   percentile_disc(0.2) within group (order by val_one) percentile_20th,
   percentile_disc(0.4) within group (order by val_one) percentile_40th,
   percentile_disc(0.7) within group (order by val_one) percentile_70th
from
   tq84_p;

Github repository about-SQL, path: /select/analytic/percentile/disc/select-some-percentiles.sql

The select statement returns 2, 4 and 7.

2 is the 20th percentile of the data because 20 % of data data are less than 2 (namely 0 and 1). The same goes for the other two numbers as well.

50th percentile / median vs average

The 50th percentile is also called the median. The number of records with a value less than the median is (± 1) the same as the number of records with a value greater then the median.

The following query compares the median with the average or mean (avg()).

select
   percentile_disc(0.5) within group (order by val_one) median_val_one,
   avg            (                            val_one) mean_val_one  ,
   percentile_disc(0.5) within group (order by val_two) median_val_two,
   avg            (                            val_two) mean_val_two
from
   tq84_p;

Github repository about-SQL, path: /select/analytic/percentile/disc/compare-median-avg.sql

The query returns 5 (= the median aka 50th percentile of val_one), 5 (the average of val_one), 10 (the median of val_two) and 909098 (the mean of val_two).

This result demonstrates that the median is much more robust against outliers. The high value of 9999999 drags the average into a region that does not look like the original data at all while the median still is in the region of most data of the original data.

Combining with group by

percentile_disc (as also any other analytic function) can also be combined with group by.

I Insert a few more records for another group (item = 'bar'):

insert into tq84_p values ('bar', 1, 3);
insert into tq84_p values ('bar', 3, 4);
insert into tq84_p values ('bar', 8, 8);

Github repository about-SQL, path: /select/analytic/percentile/disc/insert-item-bar.sql

select
   item,
   percentile_disc(0.5) within group (order by val_one) median_val_one
from
   tq84_p
group by
   item;

Github repository about-SQL, path: /select/analytic/percentile/disc/select-group-by.sql

The query now returns the median for both groups (bar: 3, baz: 5).

Combining with over clause

percentile_disc can also be combined with the over() clause so that the query returns both, the original data and the calculated percentile.

select
   item,
   val_two,
   percentile_disc(0.5) within group (order by val_two) over (partition by item) median_val_two
from
   tq84_p;

Github repository about-SQL, path: /select/analytic/percentile/disc/select-over.sql

Now, that the select statement has the over() clause, it is also possible to run in in SQL Server 2017.

SQL: PERCENTILE_DISC

Test data

Calculate some percentiles

Implementation on different databases

50th percentile / median vs average

Combining with group by

Combining with over clause

See also