ABSTRACT
Many methods have been developed for detecting multiple outliers in a single multivariate sample, but very few for the case where there may be groups in the data set. We propose a method of simultaneously determining groups (as in cluster analysis) and detecting outliers, which are points that are distant from every group. Our method is an adaptation of the BACON algorithm proposed by Billor, Hadi and Velleman for the robust detection of multiple outliers in a single group of multivariate data. There are two versions of our method, depending on whether or not the groups can be assumed to have equal covariance matrices. The effectiveness of the method is illustrated by its application to two real data sets and further shown by a simulation study for different sample sizes and dimensions for 2 and 3 groups, with and without planted outliers in the data. When the number of groups is not known in advance, the algorithm could be used as a robust method of cluster analysis, by running it for various numbers of groups and choosing the best solution.
Â
CHAPTER ONE
Introduction
The enormous extent of the literature on cluster analysis shows how important it is in many different fields of science to be able to find groups of objects on the basis of mul- tidimensional measurements. To take just one example, it is common in archaeology to come across papers dealing with the clustering of a set of objects (for example, pieces of pottery or glass on the basis of their chemical composition) in order to identify those with a common place of origin or manufacture. According to Baxter (1999), outliers are typi- cally present in data of this type and tend to cause problems in the application of standard statistical procedures. Hence, it is desirable to identify outliers. However, ‘much statistical methodology dealing with the detection of such outliers is not well suited to archaeometric data that, in the event, consists of two or more groups’ (Baxter, 1999, p. 321). Our pur- pose in the present study is to develop a method of detecting multivariate outliers that can be applied to data that are expected to have a group structure, although the details of this grouping are not known beforehand. Outliers in this context are points that are remote from every group.
Many methods of detecting outliers in multivariate data have been proposed in the literature, almost all of them applicable to a single sample (Barnett & Lewis, 1994
Rocke & Woodruffe, 1996; Penny & Jolliffe, 2001). Caroni (1998) extended the application of Wilks’ well-known test statistic for a single outlier to the case of samples from several subpopulations with a common covariance matrix, but she assumed that the number of pop- ulations was known. Thus, her results are not applicable to the situation in which it is also necessary to carry out a cluster analysis, with an unknown number of groups. A method given by Wang et al. (1997) for detecting multiple outliers from a mixture distribution was also restricted, since it required the existence of a set of points whose group membership was known, although Sain et al. (1999) removed this restriction. Recently, Hardin & Rocke (2004) have proposed another method for the problem we are considering here. The only other methods in the literature that appear to have the same purpose as the one we introduce here are those of Glascock (1992) and Beier & Mommsen (1994).
It is important that the method should be able to detect any number of outliers, not just a prespecified number. A procedure for doing this in a single group, using Wilks’ statis- tic, was given by Caroni & Prescott (1992) but it is not obvious how it can be extended to the case examined here. Instead, we propose a method based on a computationally efficient robust method of outlier detection in a single group, given the name BACON  by Billor et al. (2000). We extend the BACON algorithm to grouped multivariate data below. The application of the proposed method is illustrated using two published real data sets and a simulation study of its performance for various configurations of data is carried out.
This material content is developed to serve as a GUIDE for students to conduct academic research
ROBUST DETECTION OF MULTIPLE OUTLIERS IN A MULTIVARIATE DATA SET>
A1Project Hub Support Team Are Always (24/7) Online To Help You With Your Project
Chat Us on WhatsApp » 09063590000
DO YOU NEED CLARIFICATION? CALL OUR HELP DESK:
09063590000 (Country Code: +234)
YOU CAN REACH OUR SUPPORT TEAM VIA MAIL: [email protected]
09063590000 (Country Code: +234)