New Features of Categorical Principal Components Analysis for Complicated Data Sets, Including Data Mining

This paper focuses on a technique to perform principal components analysis with nonlinear scaling of variables, and having correspondence analysis features. Special attention will be given to particular properties that make the technique suited for data m

  • PDF / 1,656,230 Bytes
  • 11 Pages / 439 x 666 pts Page_size
  • 0 Downloads / 184 Views

DOWNLOAD

REPORT


2Eucid Inc. & SPSS Inc, Chicago, U.S.A. Abstract: This paper focuses on a technique to perform principal components analysis with nonlinear scaling of variables, and having correspondence analysis features. Special attention will be given to particular properties that make the technique suited for data mining. In addition to fitting of points for individual objects or subjects, additional points may be fitted to identify groups among them. There is a large emphasis on graphical display of the results in biplots (with variables and objects) and triplots (with variables, objects, and groups). The information contained in the biplots and triplots is used to draw special graphs that identify particular groups in the data that stand out on selected variables. Supplementary variables and objects may be used to link different data sets in a single representation. When a fixed configuration of points is given, the technique may be used for property fitting, Le., fitting external information into the space. The method can be used to analyze very large data sets by assuming that the variables are categorical; when, however, continuous variables are available as well, these can be made discrete by various optimal procedures. Ordered (ordinal) and non-ordered (nominal) data can be handled by the use of monotonic or non-monotonic (spline) transformations. A state-ofthe-art computer program (called CATPCA) is available from SPSS Categories 10.0 onwards.

1

Introduction

A prevalent type of multivariate categorical data consists of a small number of variables with a limited number of categories obtained for a very large number of objects (subjects), presented in the form of a multiway contingency table. Models are fitted to the cell counts, with respect to the margins of the table. Currently, the most popular method to analyze this type of categorical data is loglinear analysis. The information in a multiway contingency table can be efficiently transformed into a profile frequency matrix. Here, a weight is attached to each profile corresponding to the occurrence of the row profile in the data matrix. • Invited lecture

W. Gaul et al. (eds.), Classification, Automation, and New Media © Springer-Verlag Berlin, Heidelberg 2002

208 An alternative analysis for loglinear analysis is a multiple correspondence analysis of this profile frequency matrix (Meulman & Heiser, 1997). Multiple correspondence analysis aims at a simple representation of the profiles as a set of profile points in a low-dimensional space; Meulman and Heiser (1997) have shown that this graphical display also contains possible higher order interactions. When the number of categories and/or the number of variables is large, however, the number of profiles becomes large as well. In this case, it is useful to see whether the observed profiles can be classified into a limited number of clusters in the representation. Multiple correspondence analysis only takes the nominal (categorical) information into account. In a lot of cases, we would like to maintain the ordinal i