COST Action IC0702 - SoftStat




Work Group C  

Statistics with Incomplete and Imperfect Data

Statistics with Fuzzy Data

In many applications we meet data that appears to be crisp, but is actually derived from a human perception or estimation of some quantity. This is usually the case if an objective measurement device is not available (for example, because there is no simple underlying physical quantity or no appropriate physical measurement process is known) or carrying out a precise physical measurement would be too costly (at least w.r.t. the benefit that can be expected from the higher precision of the result), so that a human estimates the quantity or only provides a rating on a simple ordinal scale. Treating such data as if it where crisp is obviously inappropriate and it can be shown that loses information. However, classical statistics have few other possibilities if one does not want to invoke families of distribution functions, which may not even be applicable if there is no underlying measurable property. Therefore specialized statistical techniques are needed, which take the incomplete and imperfect character of such data properly into account. The existing techniques, which transfer statistical notions like expected values and linear regression to the fuzzy setting, are far from sufficient and need expansion and improvement.

Psychological versus Statistical Complexity

Soft computing usually emphasizes the "psychological" simplicity of a model, that is, how intuitive and easy to understand a model is for a human, who is not a data analysis expert. For this type of simplicity it is more important that the qualitative characteristics of, for instance, a functional relationship are captured than that the correct function class and the best estimates of the parameters are chosen. Two functions of considerably differing "statistical" complexity (in terms of free parameters or the description length of the model) can be equally simple "psychologically" if they exhibit the same qualitative behaviour. Furthermore, a large number of simple models, each describing a certain subspace, may be preferable to a complex overall model, even if the latter has fewer parameters. Finally, existing background knowledge, due to which only deviations from an expected behaviour need to be described instead of specifying a full model, lead to considerable differences compared to "statistical" complexity. Hence measures for the "statistical" complexity of a model are not suitable to measure "psychological" complexity. Since "psychological" complexity is what really counts in applications, but existing measures are essentially heuristic in nature, a better understanding of this type of complexity and its difference to "statistical" complexity is needed.