Mathgnostics
Data treatment tools of mathematical gnostics
The preliminary decision on using the GNU mathematical-statistical environment of the R-project available on http://www.r-project.org/ has been backed up by recent development called RExcel (http://rcom.univie.ac.at/)
making the communication and data transfer between the Microsoft Excel and programs run in R-project environment very easy. This also relates to the gnostic data treatment tools being developed within the 2-FUN project.
1. Gnostic data treatment tools: definition
Gnostic software (GSW) is a package of functions for the optimum data treatment based on the gnostic theory of individual data and small data samples (http://www.2-fun.org/download/newsletter4_en.pdf).
The GSW was primarily developed by using the S-language in the commercial environment S-PLUS® (S-PLUS is a trademark of the Insightful Corp., Seattle, Wa., USA). Its transfer to the GNU R-project was enabled due to the similarity of the R-language to the S-language. However, differences of mathematical support of the user’s programs of both systems differ making additional programming effort necessary. This activity related to the GSW is in the Institute of Public Health, Ostrava (Czech Republic) in progress.
2.“Classical” data treatment
Processing of uncertain data based on the statistical theory and its extension called the robust statistics still prevails in practice although fast development of the mathematics in post-war period brought a large number of alternative non-statistical models of uncertainty. However, most of the alternative approaches have only an academic nature with a limited applicability not resulting in availability of the sufficiently universal software, which could solve the broad scale of problems covered by the statistical models. Further factor ensuring the stable position of statistics resides in its intuitive perception and an old tradition of teaching the statistics in universities oriented to all branches of science and technology. There also are serious limitations of practical applicability of statistical methods:
- The most powerful theoretical fundament of the statistics (The Law of Large Numbers) results in requirements to the number of data available for the treatment, which cannot be always satisfied, because of the high cost of data and/or their uniqueness, non-repeatability.
- Many statistical methods are based on a priori assumptions related to the statistical data model. These assumptions are not easily provable and can make the results rather subjective than data determined. Methods of the robust statistics are dependent on broader assumptions on data model, valid for some classes of models, but the factor of subjectivity remains as the a priori assumption on the data class.
- A simplified and trivialized “statistical” way of thinking put down roots very deep enough to touch such every-day activities like the quality assessment of production and even the official norms of quality. This is reflected by requirements of norms to measure the quality by the first two statistical moments (mean and variance), the application of which is meaningful only in a small number of probability distributions (especially for the normal – Gaussian – distribution).
3. Gnostic data treatment tools
Specific features of gnostic methodology lies in the following:
- Theoretically proved validity for individual uncertain data and their small samples.
- No a priori assumptions on special models of data. Data models are derived from data only.
- Two (optional) kinds of inherent robustness of results (with respect to outliers or inliers).
- Maximization of information in results as the optimality criterion of the process.
- Applicability to data contaminated by strong uncertainty of a general nature.
- Adherence to Law of Nature reflected in mathematics, theory of real measurements, physics and modern geometry.
- Coherence with statistics in special cases of small data uncertainty.
There are two groups of tasks solved by the gnostic software: marginal (one-dimensional) analysis and multi-dimensional analysis.
3.1 Gnostic marginal analysis
Instead of point estimates (like statistical moments estimated directly from data samples), gnostic marginal analysis is based on kernel estimates of probability and probability density functions using special theoretically supported kernels and their both additive and non-additive composition. Their are following groups of functions available:
- Estimation of four kinds and types of distribution functions differing by their robustness, flexibility, composition and applicability to different analytical tasks. Three kinds of censored data (the left- and right-censored and interval data) are considered along with the non-censored additive or multiplicative data from both open, semi-closed and open data ranges.
- Functions applying the distribution functions to estimation of the bounds of the data support, of the probability to given quantile and quantile to given probability, to probability predictions of events, to robust and objective testing of data homogeneity, to split non-homogeneous data samples into clusters of homogeneous sub-samples, to objectively estimate the unique bounds of the membership intervals of a homogeneous data sample, to establish typical data intervals and to classify them, to summarize estimated parameters of the distribution function, to certify a distribution function and to perform both one- and two samples testing of hypotheses including the classification and probabilistic evaluation of relations between data samples and their distributions.
- Functions operating on a list of gnostic distribution functions enabling to compare distribution functions, to summarize their parameters reviewing thus features of whole data matrices and to robustly estimate correlation matrices and evaluate the correlation significance.
3.2 The multi-dimensional (MD-) analysis
Gnostic MD-analysis is based on the gnostic solution of a generalized regression problem (Kovanic, P.: A New Theoretical and Algorithmical Base for Estimation, Identification and Control, Automatica, Vol.22, No.6, pp.657-674 (1986)) programmed in the way similar to well-known statistical method of WLSQ (Weighted Least Squares Method) or to the M-estimator of the robust statistics. However, gnostic theory provides original weights for the WLSQ-method (or “the influence function” in M-estimators). Gnostic weights ensure not only a high robustness of the regression model, but also the maximum of information obtained from data. Following functions are available:
- Identification and evaluation of both explicit (usual) and implicit robust regression models of relations between data and/or their logarithms (the usual task) as well as between probabilities of data. Implicit models are especially suitable in cases of an unknown or impossible subdivision of variable on the “only dependent” and “only explaining” ones because all variables are dependent on each other.
- Applications of the MD-models: objective ordering of samples in an MD-space, homogenization of MD-samples, MD-cluster analysis, monitoring, analysis and prediction of MD-time series, decision making on
MD-samples.
Examples: Presentation from the IPSW09, Prague, 26.-29. May 2009
For assistance in these topics contact Pavel Kovanic