Introduction  ::   Probability distributions  ::   Linear regression  ::   Statistical tests  ::   Installation

Introduction

What PostStat is

PostStat is a set of statistics extensions for PostgreSQL. It includes routines that compute a number of cumulative probability distributions, linear regression, and statistical tests, both parametric and non-parametric. The extensions were implemented as a way to test statistical hypothesis in simple models and so the returned value from all routines can be -- direct or indirectly -- seen as a p-value.

PostStat is licensed under the MIT license (very similar to PostgreSQL's BSD license) and so can be freely used for academic and commercial purposes. Please refer to the Installation section for more details.

Probability distributions

Probability densities and cumulative distributions

The motivation for providing probability densities (mass functions in the case of discrete distributions) and cumulative distribution functions (CDFs) is to allow the user the possibility of running quick statistical tests and thus obtain p-values. Some statistical tests rely on point hypothesis, which justifies the availability of their respective density functions. All probability distribution routines return double precision values.

Discrete distributions

dbinom(x, n, p)
pbinom(x, n, p)

Probability mass (dbinom) and cumulative distribution (pbinom) functions for the binomial distribution with input x, the number of successes, and parameters n, the number of experiments, and p, the probability of success.

dhyper(x, w, b, n)
phyper(x, w, b, n)

Probability mass (dhyper) and cumulative distribution (phyper) functions for the hypergeometric distribution with input x, the number of white balls drawn from an urn that contains black and white balls, and parameters w, the number of white balls in the urn, b, the number of black balls in the urn, and n, the number of balls drawn from the urn.

pnbinom(x, n, p)

CDF for the negative binomial distribution with input x, the number of failures before n successes in a sequence of Bernoulli experiments with probability of success p.

ppois(x, lambda)

CDF for the Poisson distribution with input x and parameter lambda.

Continuous distributions

pexp(x, rate)

CDF for the exponential distribution with input x and parameter rate.

pgamma(x, shape [, scale])

CDF for the gamma distribution with input x and parameters shape and scale. The default value for scale is 1.

pnorm(x [, mean [, sd]])

CDF for the normal distribution with input x and parameters mean and sd, standard deviation. The default values for mean is 0 and for sd is 1.

pchisq(x, df [, pnonc])

CDF for the chi-squared distribution with input x and parameters df, the number of degrees of freedom, and pnonc, the non-centrality parameter. The default value for pnonc is 0.

pt(x, df)

CDF for the Student t distribution with input x and parameter df, the number of degrees of freedom.

pf(x, dfn, dfd [, pnonc])

CDF for the F distribution with input x and parameters dfn, the number of degrees of freedom in the numerator, dfd, the number of degrees of freedom in the denominator, and pnonc, the non-centrality parameter. The default value for pnonc is 0.

Non-parametric distributions

dsignrank(x, n)
psignrank(x, n)

Probability density (dsignrank) and cumulative distribution (psignrank) functions for the distribution of the Wilcoxon signed rank statistic with input x and parameter n, the size of the sample.

dwilcox(x, m, n)
pwilcox(x, m, n)

Probability density (dwilcox) and cumulative distribution (pwilcox) functions for the distribution of the Wilcoxon rank sum statistic with input x and parameters m and n, the sizes of the samples.

Linear regression

Simple linear fit

Simple linear models can be computed using the low level functions lsfit and lssfit. Both functions compute the minimum-norm solution of linear least squares (LLS) problem specified by the arguments and return the residual sum-of-squares (RSS) and the rank of the design matrix. Note that the actual solution to the LLS problem is not returned but only the sufficient statistics to generate a p-value from a F-test using the higher level functions lm and lms.

Aggregate functions vector_accum and matrix_accum can be used to build array arguments for the linear system functions above. A fstat type, defined as

CREATE TYPE fstat AS (f double precision, pvalue double precision);

is used to return values from lm and lms.

vector_accum(x)

Aggregate function that takes a set of double precision values and returns a one-dimensional array containing them.

matrix_accum(row)

Aggregate function that takes a set of one-dimensional double precision array as rows and returns a two-dimensional array.

lsfit(y, X)
lssfit(y, X)

Takes a one-dimensional double precision array y and a two-dimensional double precision array X, finds the solution beta that minimizes the norm of X * beta - y, and returns a double precision array containing the minimum norm and the rank of X in this order. lsfit computes the minimum norm using a complete orthogonal factorization of X, while lssfit computes the singular value decomposition (SVD) of X.

lm(y, X [, X0])
lms(y, X [, X0])

Returns a fstat type containing the F statistic and p-value for testing y ~ X against the null y ~ X0. y and X have the same interpretation and specification as in lsfit and lssfit. X0 is similar to X and taken to be a zero array if not provided. lm uses lsfit to derive RSS and rank while lms uses lssfit.

Example

Suppose we want to linearly fit the following data:

# select * from linreg;
 id | x | y 
----+---+---
  1 | 1 | 2
  2 | 2 | 3
  3 | 3 | 5
  4 | 4 | 9
  5 | 5 | 8
(5 rows)

Fitting y to beta * x + alpha gives:

# select lsfit(vector_accum(y), matrix_accum(ARRAY[1, x])) from linreg;
  lsfit  
---------
 {4.8,2}
(1 row)

To test against the null of beta = 0 we issue:

# select lm(vector_accum(y), matrix_accum(ARRAY[1, x]),
#   matrix_accum(ARRAY[1])) from linreg;
             lm             
----------------------------
 (20.25,0.0204904123444534)
(1 row)

and so we can reject at 5% level since the p-value ~ 0.02.

Statistical tests

Parametric and non-parametric tests

PostStat also provides other common tests that are more elaborate and cannot be easily cast in a CDF formulation.

fisher(t [, hybrid])
fisher2(t11, t12, t21, t22)

Returns the p-value of Fisher's exact test for testing the null of independence of rows and columns in a contingency table t with fixed marginals. Table t should be a two-dimensional double precision array. Optional parameter hybrid indicates if exact probabilities should be computed (the default) or a hybrid approximation should be assessed.

fisher2 is provided for convenience when running the test on 2 x 2 tables, and is equivalent to calling fisher(t) when t = ARRAY[[t11, t12], [t21, t22]].

shapiro(v)

Returns the p-value of the Shapiro-Wilk test of normality on the one-dimensional single precision (real) array v.

Set enrichment analysis

Given a ranked list L and a set S of elements in L, the purpose of Set Enrichment Analysis (SEA) is to assess how significant is the rank distribution in S with respect to a uniform rank distribution. S partitions the elements in L by pertinence, and usually represents a category that classifies elements.

Statistical significance is assessed by the p-value of a max deviation running sum statistic. More details about this statistic can be found at the paper that inspired this routine:

  Keller, A., Backes, C., and Lenhof, H-P, (2007)
  "Computation of significance scores of unweighted Gene Set Enrichment Analyses",
  BMC Bioinformatics (8).
  http://www.biomedcentral.com/1471-2105/8/290

The p-value is computed similarly to the reference above, using dynamic programming, but uses a different formulation, is more memory efficient, and does not depend on external libraries for extended precision.

sea(m, l, s)

Returns the p-value of the max deviation running sum statistic s for a ranked list of size m and a set of size l.

sea(rlist)

Aggregate function that takes an ordered boolean set rlist as a ranked list where true signals pertinence to a set and returns the p-value of its max deviation running sum statistic.

Installation

How to obtain and install PostStat

PostStat is distributed as a source package and can be obtained at PgFoundry. After downloading and unpacking, our first task is to compile libf77stat. A FORTRAN compiler is needed, and you might have to change the Makefile to suit your needs.

$ cd f77stat && make && make clean

The source package is then installed like any regular PostgreSQL module, that is, just run:

$ make && sudo make install
$ psql -f poststat.sql <database>

BLAS and LAPACK libraries are required to build PostStat. For better performance it is recommended to install an optimized version of these libraries, such as the ones provided by the ATLAS project. As in the libf77stat compilation, you might need to edit the Makefile according to your setup.

License

Copyright (c) 2008 Luis Carvalho

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.