This module has has a number of statistics functions. Use
require("stats")
to load it.
Compute the median of an array of values
m = median (a [,i])
This function computes the median of an array of values. The median is defined to be the value such that half of the the array values will be less than or equal to the median value and the other half greater than or equal to the median value. If the array has an even number of values, then the median value will be the smallest value that is greater than or equal to half the values in the array.
If called with a second argument, then the optional argument specifies the dimension of the array over which the median is to be taken. In this case, an array of one less dimension than the input array will be returned.
This function makes a copy of the input array and then partially
sorts the copy. For large arrays, it may be undesirable to allocate
a separate copy. If memory use is to be minimized, the
median_nc
function should be used.
median_nc, mean
Compute the median of an array
m = median_nc (a [,i])
This function computes the median of an array. Unlike the
median
function, it does not make a temporary copy of the
array and, as such, is more memory efficient at the expense
increased run-time. See the median
function for more
information.
median, mean
Compute the mean of the values in an array
m = mean (a [,i])
This function computes the arithmetic mean of the values in an
array. The optional parameter i
may be used to specify the
dimension over which the mean it to be take. The default is to
compute the mean of all the elements.
Suppose that a
is a two-dimensional MxN array. Then
m = mean (a);
will assign the mean of all the elements of a
to m
.
In contrast,
m0 = mean(a,0);
m1 = mean(a,1);
will assign the N element array to m0
, and an array of
M elements to m1
. Here, the jth element of m0
is
given by mean(a[*,j])
, and the jth element of m1
is
given by mean(a[j,*])
.
stddev, median, kurtosis, skewness
Compute the standard deviation of an array of values
s = stddev (a [,i])
This function computes the standard deviation of the values in the
specified array. The optional parameter i
may be used to
specify the dimension over which the standard-deviation it to be
taken. The default is to compute the standard deviation of all the
elements.
This function returns the unbiased N-1 form of the sample standard deviation.
mean, median, kurtosis, skewness
Compute the skewness of an array of values
s = skewness (a)
This function computes the so-called skewness of the array a
.
mean, stddev, kurtosis
Compute the kurtosis of an array of values
s = kurtosis (a)
This function computes the so-called kurtosis of the array a
.
This function is defined such that the kurtosis of the normal distribution is 0, and is also known as the ``excess-kurtosis''.
mean, stddev, skewness
Compute binomial coefficients
c = binomial (n [,m])
This function computes the binomial coefficients (n m) where (n m)
is given by n!/(m!(n-m)!). If m
is not provided, then an
array of coefficients for m=0 to n will be returned.
Compute the Chisqr CDF
cdf = chisqr_cdf (Int_Type n, Double_Type d)
This function returns the probability that a random number
distributed according to the chi-squared distribution for n
degrees of freedom will be less than the non-negative value d
.
The importance of this distribution arises from the fact that if
n
independent random variables X_1,...X_n
are
distributed according to a gaussian distribution with a mean of 0 and
a variance of 1, then the sum
X_1^2 + X_2^2 + ... + X_n^2
follows the chi-squared distribution with n
degrees of freedom.
chisqr_test, poisson_cdf
Compute the Poisson CDF
cdf = poisson_cdf (Double_Type m, Int_Type k)
This function computes the CDF for the Poisson probability
distribution parameterized by the value m
. For values of
m>100
and abs(m-k)<sqrt(m)
, the Wilson and Hilferty
asymptotic approximation is used.
chisqr_cdf
Compute the Kolmogorov CDF using Smirnov's asymptotic form
cdf = smirnov_cdf (x)
This function computes the CDF for the Kolmogorov distribution using Smirnov's asymptotic form. In particular, the implementation is based upon equation 1.4 from W. Feller, "On the Kolmogorov-Smirnov limit theorems for empirical distributions", Annals of Math. Stat, Vol 19 (1948), pp. 177-190.
ks_test, ks_test2, normal_cdf
Compute the CDF for the Normal distribution
cdf = normal_cdf (x)
This function computes the CDF (integrated probability) for the normal distribution.
smirnov_cdf, mann_whitney_cdf, poisson_cdf
Compute the Mann-Whitney CDF
cdf = mann_whitney_cdf (Int_Type m, Int_Type n, Int_Type s)
This function computes the exact CDF P(X<=s) for the Mann-Whitney
distribution. It is used by the mw_test
function to compute
p-values for small values of m
and n
.
mw_test, ks_test, normal_cdf
Compute the 2-sample KS CDF using the Kim-Jennrich Algorithm
p = kim_jennrich (UInt_Type m, UInt_Type n, UInt_Type c)
This function returns the exact two-sample Kolmogorov-Smirnov
probability that that D_mn <= c/(mn)
, where D_mn
is
the two-sample Kolmogorov-Smirnov statistic computed from samples of
sizes m
and n
.
The algorithm used is that of Kim and Jennrich. The run-time scales
as m*n. As such, it is recommended that asymptotic form given by
the smirnov_cdf
function be used for large values of m*n.
For more information about the Kim-Jennrich algorithm, see: Kim, P.J., and R.I. Jennrich (1973), Tables of the exact sampling distribution of the two sample Kolmogorov-Smirnov criterion Dmn(m<n), in Selected Tables in Mathematical Statistics, Volume 1, (edited by H. L. Harter and D.B. Owen), American Mathematical Society, Providence, Rhode Island.
smirnov_cdf, ks_test2
Compute the CDF for the F distribution
cdf = f_cdf (t, nu1, nu2)
This function computes the CDF for the distribution and returns its value.
f_test2
One sample Kolmogorov test
p = ks_test (CDF [,&D])
This function applies the Kolmogorov test to the data represented by
CDF
and returns the p-value representing the probability that
the data values are ``consistent'' with the underlying distribution
function. If the optional parameter is passed to the function, then
it must be a reference to a variable that, upon return, will be
set to the value of the Kolmogorov statistic..
The CDF
array that is passed to this function must be computed
from the assumed probability distribution function. For example, if
the data are constrained to lie between 0 and 1, and the null
hypothesis is that they follow a uniform distribution, then the CDF
will be equal to the data. In the data are assumed to be normally
(Gaussian) distributed, then the normal_cdf
function can be
used to compute the CDF.
Suppose that X is an array of values obtained from repeated
measurements of some quantity. The values are are assumed to follow
a normal distribution with a mean of 20 and a standard deviation of
3. The ks_test
may be used to test this hypothesis using:
pval = ks_test (normal_cdf(X, 20, 3));
ks_test2, kuiper_test, t_test, z_test
Two-Sample Kolmogorov-Smirnov test
prob = ks_test2 (X, Y [,&d])
This function applies the 2-sample Kolmogorov-Smirnov test to two
datasets X
and Y
and returns p-value for the null
hypothesis that they share the same underlying distribution.
If the optional parameter is passed to the function, then
it must be a reference to a variable that, upon return, will be
set to the value of the statistic.
If length(X)*length(Y)<=10000
, the kim_jennrich_cdf
function will be used to compute the exact probability. Otherwise an
asymptotic form will be used.
ks_test, kuiper_test, kim_jennrich_cdf
Perform a 1-sample Kuiper test
pval = kuiper_test (CDF [,&D])
This function applies the Kuiper test to the data represented by
CDF
and returns the p-value representing the probability that
the data values are ``consistent'' with the underlying distribution
function. If the optional parameter is passed to the function, then
it must be a reference to a variable that, upon return, will be
set to the value of the Kuiper statistic.
The CDF
array that is passed to this function must be computed
from the assumed probability distribution function. For example, if
the data are constrained to lie between 0 and 1, and the null
hypothesis is that they follow a uniform distribution, then the CDF
will be equal to the data. In the data are assumed to be normally
(Gaussian) distributed, then the normal_cdf
function can be
used to compute the CDF.
Suppose that X is an array of values obtained from repeated
measurements of some quantity. The values are are assumed to follow
a normal distribution with a mean of 20 and a standard deviation of
3. The ks_test
may be used to test this hypothesis using:
pval = kuiper_test (normal_cdf(X, 20, 3));
kuiper_test2, ks_test, t_test
Perform a 2-sample Kuiper test
pval = kuiper_test2 (X, Y [,&D])
This function applies the 2-sample Kuiper test to two
datasets X
and Y
and returns p-value for the null
hypothesis that they share the same underlying distribution.
If the optional parameter is passed to the function, then
it must be a reference to a variable that, upon return, will be
set to the value of the Kuiper statistic.
The p-value is computed from an asymotic formula suggested by Stephens, M.A., Journal of the American Statistical Association, Vol 69, No 347, 1974, pp 730-737.
ks_test2, kuiper_test
Apply the Chi-square test to a two or more datasets
prob = chisqr_test (X_1[], X_2[], ..., X_N [,&t])
This function applies the Chi-square test to the N datasets
X_1
, X_2
, ..., X_N
, and returns the probability
that each of the datasets were drawn from the same underlying
distribution. Each of the arrays X_k
must be the same length.
If the last parameter is a reference to a variable, then upon return
the variable will be set to the value of the statistic.
chisqr_cdf, ks_test2, mw_test
Apply the Two-sample Wilcoxon-Mann-Whitney test
p = mw_test(X, Y [,&w])
This function performs a Wilcoxon-Mann-Whitney test and returns the
p-value for the null hypothesis that there is no difference between
the distributions represented by the datasets X
and Y
.
If a third argument is given, it must be a reference to a variable
whose value upon return will be to to the rank-sum of X
.
The function makes use of the following qualifiers:
side=">" : H0: P(X<Y) >= 1/2 (right-tail)
side="<" : H0: P(X<Y) <= 1/2 (left-tail)
The default null hypothesis is that P(X<Y)=1/2
.
There are a number of definitions of this test. While the exact definition of the statistic varies, the p-values are the same.
If length(X)<50
, length(Y)
< 50, and ties are not
present, then the exact p-value is computed using the
mann_whitney_cdf
function. Otherwise a normal distribution is
used.
This test is often referred to as the non-parametric generalization of the Student t-test.
mann_whitney_cdf, ks_test2, chisqr_test, t_test
Apply the Two-sample F test
p = f_test2 (X, Y [,&F]
This function computes the two-sample F statistic and its p-value
for the data in the X
and Y
arrays. This test is used
to compare the variances of two normally-distributed data sets, with
the null hypothesis that the variances are equal. The return value
is the p-value, which is computed using the module's f_cdf
function.
The function makes use of the following qualifiers:
side=">" : H0: Var[X] >= Var[Y] (right-tail)
side="<" : H0: Var[X] <= Var[Y] (left-tail)
f_cdf, ks_test2, chisqr_test
Perform a Student t-test
pval = t_test (X, mu [,&t])
This function computes Student's t-statistic and returns the p-value
that the data X are consistent with a Gaussian distribution with a
mean of mu
. If the optional parameter is passed to the function, then
it must be a reference to a variable that, upon return, will be
set to the value of the statistic.
The following qualifiers may be used to specify a 1-sided test:
side="<" Perform a left-tailed test
side=">" Perform a right-tailed test
Strictly speaking, this test should only be used if the variance of
the data are equal to that of the assumed parent distribution. Use
the Mann-Whitney-Wilcoxon (mw_test
) if the underlying
distribution is non-normal.
mw_test, t_test2
Perform a 2-sample Student t-test
pval = t_test2 (X, Y [,&t])
This function compares two data sets X
and Y
using the
Student t-statistic. It is assumed that the the parent populations
are normally distributed with equal variance, but with possibly
different means. The test is one that looks for differences in the
means.
The welch_t_test2
function may be used if it is not known that
the parent populations have the same variance.
t_test2, welch_t_test2, mw_test
Perform Welch's t-test
pval = welch_t_test2 (X, Y [,&t])
This function applies Welch's t-test to the 2 datasets X
and
Y
and returns the p-value that the underlying populations have
the same mean. The parent populations are assumed to be normally
distributed, but need not have the same variance.
If the optional parameter is passed to the function, then
it must be a reference to a variable that, upon return, will be
set to the value of the statistic.
The following qualifiers may be used to specify a 1-sided test:
side="<" Perform a left-tailed test
side=">" Perform a right-tailed test
t_test2
Perform a Z test
pval = z_test (X, mu, sigma [,&z])
This function applies a Z test to the data X
and returns the
p-value that the data are consistent with a normally-distributed
parent population with a mean of mu
and a standard-deviation
of sigma
. If the optional parameter is passed to the function, then
it must be a reference to a variable that, upon return, will be
set to the value of the Z statistic.
t_test, mw_test
Kendall's tau Correlation Test
pval = kendall_tau (x, y [,&tau]
This function computes Kendall's tau statistic for the paired data values (x,y). It returns the p-value associated with the statistic.
The current version of this function uses an asymptotic formula based upon the normal distribution to compute the p-value.
The following qualifiers may be used to specify a 1-sided test:
side="<" Perform a left-tailed test
side=">" Perform a right-tailed test
spearman_r, pearson_r
Compute Pearson's Correlation Coefficient
pval = pearson_r (X, Y [,&r])
This function computes Pearson's r correlation coefficient of the two
datasets X
and Y
. It returns the the p-value that
x
and y
are mutually independent.
If the optional parameter is passed to the function, then
it must be a reference to a variable that, upon return, will be
set to the value of the correlation coefficient.
The following qualifiers may be used to specify a 1-sided test:
side="<" Perform a left-tailed test
side=">" Perform a right-tailed test
kendall_tau, spearman_r
Spearman's Rank Correlation test
pval = spearman_r(x, y [,&r])
This function computes the Spearman rank correlation coefficient (r)
and returns the p-value that x
and y
are mutually
independent.
If the optional parameter is passed to the function, then
it must be a reference to a variable that, upon return, will be
set to the value of the correlation coefficient.
The following qualifiers may be used to specify a 1-sided test:
side="<" Perform a left-tailed test
side=">" Perform a right-tailed test
kendall_tau, pearson_r
Compute the sample correlation between two datasets
c = correlation (x, y)
This function computes Pearson's sample correlation coefficient between 2 arrays. It is assumed that the standard deviation of each array is finite and non-zero. The returned value falls in the range -1 to 1, with -1 indicating that the data are anti-correlated, and +1 indicating that the data are completely correlated.
covariance, stdddev