## Introduction

What PostStat is**PostStat** is a set of statistics extensions for
PostgreSQL. It includes routines that
compute a number of cumulative probability distributions, linear regression,
and statistical tests, both parametric and non-parametric. The extensions were
implemented as a way to *test statistical hypothesis* in simple models
and so the returned value from all routines can be -- direct or indirectly --
seen as a *p-value*.

PostStat is licensed under the MIT license (very similar to PostgreSQL's BSD license) and so can be freely used for academic and commercial purposes. Please refer to the Installation section for more details.

## Probability distributions

Probability densities and cumulative distributionsThe motivation for providing probability densities (mass functions in the
case of discrete distributions) and cumulative distribution functions (CDFs) is
to allow the user the possibility of running quick statistical tests and
thus obtain p-values. Some statistical tests rely on point hypothesis, which
justifies the availability of their respective density functions. All
probability distribution routines return `double precision`

values.

### Discrete distributions

`dbinom(x, n, p)`

`pbinom(x, n, p)`

Probability mass (`dbinom`

) and cumulative distribution
(`pbinom`

) functions for the *binomial* distribution with
input `x`

, the *number of successes*, and parameters
`n`

, the *number of experiments*, and `p`

, the
*probability of success*.

`dhyper(x, w, b, n)`

`phyper(x, w, b, n)`

Probability mass (`dhyper`

) and cumulative distribution
(`phyper`

) functions for the *hypergeometric* distribution
with input `x`

, the number of white balls drawn from an urn
that contains black and white balls, and parameters `w`

, the
number of white balls in the urn, `b`

, the number of black balls in
the urn, and `n`

, the number of balls drawn from the urn.

`pnbinom(x, n, p)`

CDF for the *negative binomial* distribution with input
`x`

, the number of failures before `n`

successes in a
sequence of Bernoulli experiments with probability of success
`p`

.

`ppois(x, lambda)`

CDF for the *Poisson* distribution with input `x`

and
parameter * lambda*.

### Continuous distributions

`pexp(x, rate)`

CDF for the *exponential* distribution with input `x`

and
parameter * rate*.

`pgamma(x, shape [, scale])`

CDF for the *gamma* distribution with input `x`

and
parameters * shape* and

*. The default value for*

`scale`

`scale`

is 1.`pnorm(x [, mean [, sd]])`

CDF for the *normal* distribution with input `x`

and
parameters * mean* and

`sd`

,
*standard deviation*. The default values for

`mean`

is 0 and
for `sd`

is 1.`pchisq(x, df [, pnonc])`

CDF for the *chi-squared* distribution with input `x`

and
parameters `df`

, the number of *degrees of freedom*, and
`pnonc`

, the *non-centrality* parameter. The default value
for `pnonc`

is 0.

`pt(x, df)`

CDF for the *Student t* distribution with input `x`

and
parameter `df`

, the number of *degrees of freedom*.

`pf(x, dfn, dfd [, pnonc])`

CDF for the *F* distribution with input `x`

and
parameters `dfn`

, the number of degrees of freedom in the
*numerator*, `dfd`

, the number of degrees of freedom in the
*denominator*, and `pnonc`

, the *non-centrality*
parameter. The default value for `pnonc`

is 0.

### Non-parametric distributions

`dsignrank(x, n)`

`psignrank(x, n)`

Probability density (`dsignrank`

) and cumulative distribution
(`psignrank`

) functions for the distribution of the
*Wilcoxon signed rank statistic* with input `x`

and
parameter `n`

, the *size of the sample*.

`dwilcox(x, m, n)`

`pwilcox(x, m, n)`

Probability density (`dwilcox`

) and cumulative distribution
(`pwilcox`

) functions for the distribution of the
*Wilcoxon rank sum statistic* with input `x`

and
parameters `m`

and `n`

, the
*sizes of the samples*.

## Linear regression

Simple linear fitSimple linear models can be computed using the low level functions
* lsfit* and

*. Both functions compute the minimum-norm solution of linear least squares (LLS) problem specified by the arguments and return the residual sum-of-squares (RSS) and the rank of the design matrix. Note that the actual solution to the LLS problem is not returned but only the sufficient statistics to generate a p-value from a F-test using the higher level functions*

`lssfit`

*and*

`lm`

*.*

`lms`

Aggregate functions `vector_accum`

and `matrix_accum`

can be used to build array arguments for the linear system functions
above. A `fstat`

type, defined as

CREATE TYPE fstat AS (f double precision, pvalue double precision);

is used to return values from `lm`

and `lms`

.

`vector_accum(x)`

Aggregate function that takes a set of `double precision`

values
and returns a one-dimensional array containing them.

`matrix_accum(row)`

Aggregate function that takes a set of one-dimensional
`double precision`

array as rows and returns a two-dimensional
array.

`lsfit(y, X)`

`lssfit(y, X)`

Takes a one-dimensional `double precision`

array `y`

and a two-dimensional `double precision`

array `X`

,
finds the solution `beta`

that minimizes the norm of
`X * beta - y`

, and returns a `double precision`

array
containing the minimum norm and the rank of `X`

in this order.
`lsfit`

computes the minimum norm using a complete orthogonal
factorization of `X`

, while `lssfit`

computes the
singular value decomposition (SVD) of `X`

.

`lm(y, X [, X0])`

`lms(y, X [, X0])`

Returns a `fstat`

type containing the F statistic and p-value
for testing `y ~ X`

against the null `y ~ X0`

.
`y`

and `X`

have the same interpretation and
specification as in `lsfit`

and `lssfit`

.
`X0`

is similar to `X`

and taken to be a zero array if
not provided. `lm`

uses `lsfit`

to derive RSS and
rank while `lms`

uses `lssfit`

.

### Example

Suppose we want to linearly fit the following data:

# select * from linreg; id | x | y ----+---+--- 1 | 1 | 2 2 | 2 | 3 3 | 3 | 5 4 | 4 | 9 5 | 5 | 8 (5 rows)

Fitting `y`

to `beta * x + alpha`

gives:

# select lsfit(vector_accum(y), matrix_accum(ARRAY[1, x])) from linreg; lsfit --------- {4.8,2} (1 row)

To test against the null of `beta = 0`

we issue:

# select lm(vector_accum(y), matrix_accum(ARRAY[1, x]), # matrix_accum(ARRAY[1])) from linreg; lm ---------------------------- (20.25,0.0204904123444534) (1 row)

and so we can reject at 5% level since the p-value ~ 0.02.

## Statistical tests

Parametric and non-parametric testsPostStat also provides other common tests that are more elaborate and cannot be easily cast in a CDF formulation.

`fisher(t [, hybrid])`

`fisher2(t11, t12, t21, t22)`

Returns the p-value of Fisher's exact test for testing the null of
independence of rows and columns in a contingency table `t`

with
fixed marginals. Table `t`

should be a two-dimensional double
precision array. Optional parameter `hybrid`

indicates if exact
probabilities should be computed (the default) or a hybrid approximation
should be assessed.

`fisher2`

is provided for convenience when running the test on 2
x 2 tables, and is equivalent to calling `fisher(t)`

when
`t = ARRAY[[t11, t12], [t21, t22]]`

.

`shapiro(v)`

Returns the p-value of the Shapiro-Wilk test of normality on the
one-dimensional single precision (`real`

) array `v`

.

### Set enrichment analysis

Given a *ranked* list `L`

and a set `S`

of
elements in `L`

, the purpose of Set Enrichment Analysis (SEA) is to
assess how significant is the rank distribution in `S`

with respect
to a uniform rank distribution. `S`

partitions the elements in
`L`

by pertinence, and usually represents a category that
classifies elements.

Statistical significance is assessed by the p-value of a
*max deviation running sum* statistic. More details about this
statistic can be found at the paper that inspired this routine:

Keller, A., Backes, C., and Lenhof, H-P, (2007) "Computation of significance scores of unweighted Gene Set Enrichment Analyses", BMC Bioinformatics (8). http://www.biomedcentral.com/1471-2105/8/290

The p-value is computed similarly to the reference above, using dynamic programming, but uses a different formulation, is more memory efficient, and does not depend on external libraries for extended precision.

`sea(m, l, s)`

Returns the p-value of the max deviation running sum statistic
`s`

for a ranked list of size `m`

and a set of size
`l`

.

`sea(rlist)`

Aggregate function that takes an ordered boolean set `rlist`

as
a ranked list where `true`

signals pertinence to a set and returns
the p-value of its max deviation running sum statistic.

## Installation

How to obtain and install PostStatPostStat is distributed as a source package and can be obtained at
PgFoundry.
After downloading and unpacking, our first task is to compile
`libf77stat`

. A FORTRAN compiler is needed, and you might have to
change the Makefile to suit your needs.

$ cd f77stat && make && make clean

The source package is then installed like any regular PostgreSQL module, that is, just run:

$ make && sudo make install $ psql -f poststat.sql<database>

BLAS and
LAPACK libraries are required to
build PostStat. For better performance it is recommended to install an
*optimized* version of these libraries, such as the ones provided by
the ATLAS project. As in the
`libf77stat`

compilation, you might need to edit the Makefile
according to your setup.

### License

Copyright (c) 2008 Luis Carvalho

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.