This is an attempt at trying to explain the percentile_disc() SQL analytic function.
percentile_disc(p) within group (over order by val), with 0<= p <= 1, returns a value r such that p*100 percent of the values of val are less than r, within a given group (thus the clause within group).
I am trying to demonstrate that with the following SQL snippets.
I now let SQL calculate the 20th, 40th and 70th percentile of the data.
select
percentile_disc(0.2) within group (order by val_one) percentile_20th,
percentile_disc(0.4) within group (order by val_one) percentile_40th,
percentile_disc(0.7) within group (order by val_one) percentile_70th
from
tq84_p;
The 50th percentile is also called the median. The number of records with a value less than the median is (± 1) the same as the number of records with a value greater then the median.
The following query compares the median with the average or mean (avg()).
select
percentile_disc(0.5) within group (order by val_one) median_val_one,
avg ( val_one) mean_val_one ,
percentile_disc(0.5) within group (order by val_two) median_val_two,
avg ( val_two) mean_val_two
from
tq84_p;
The query returns 5 (= the median aka 50th percentile of val_one), 5 (the average of val_one), 10 (the median of val_two) and 909098 (the mean of val_two).
This result demonstrates that the median is much more robust against outliers. The high value of 9999999 drags the average into a region that does not look like the original data at all while the median still is in the region of most data of the original data.
Combining with group by
percentile_disc (as also any other analytic function) can also be combined with group by.
I Insert a few more records for another group (item = 'bar'):
insert into tq84_p values ('bar', 1, 3);
insert into tq84_p values ('bar', 3, 4);
insert into tq84_p values ('bar', 8, 8);