User Tools

Site Tools


sift:principal_component_analysis:outlier_detection_for_pca

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
sift:principal_component_analysis:outlier_detection_for_pca [2024/07/17 15:22] sgrangersift:principal_component_analysis:outlier_detection_for_pca [2024/08/28 19:04] (current) – [Reference] wikisysop
Line 3: Line 3:
 Outlier Detection is encompassed in a variety of statistical methods which look to find data that is not representative of dataset to which it belongs. These outliers (or anomalies) may then be further analyzed, or simply discarded. There are a variety of different methods to do this, including supervised and unsupervised methods. Here we will describe some common detection methods, all of which have been implemented into Sift. Specifically, these methods are used to detect outliers using [[Sift:Application:Analyse_Page#Workspace_Scores|PCA Workspace Scores]], the scores obtained by applying the PCA Loading vectors onto the original waveforms. This is computationally much cheaper than using original waveforms, as we use PCA to significantly reduce the dimensionality of the data, while retaining much of the variance. More information about PCA can be found on the [[Sift:Principal_Component_Analysis:Using_Principal_Component_Analysis_in_Biomechanics|PCA page]]. Outlier Detection is encompassed in a variety of statistical methods which look to find data that is not representative of dataset to which it belongs. These outliers (or anomalies) may then be further analyzed, or simply discarded. There are a variety of different methods to do this, including supervised and unsupervised methods. Here we will describe some common detection methods, all of which have been implemented into Sift. Specifically, these methods are used to detect outliers using [[Sift:Application:Analyse_Page#Workspace_Scores|PCA Workspace Scores]], the scores obtained by applying the PCA Loading vectors onto the original waveforms. This is computationally much cheaper than using original waveforms, as we use PCA to significantly reduce the dimensionality of the data, while retaining much of the variance. More information about PCA can be found on the [[Sift:Principal_Component_Analysis:Using_Principal_Component_Analysis_in_Biomechanics|PCA page]].
  
-==== Unsupervised Methods ====+===== Unsupervised Methods =====
  
-=== Local Outlier Factor ===+==== Local Outlier Factor ====
  
-Local Outlier Factor (LOF) is an unsupervised method of finding outliers through a data points local density, introduced in 2000 by Breunig et al. [[[https://www.dbs.ifi.lmu.de/Publikationen/Papers/LOF.pdf|[1]]]]. It compares the local density of each point to its local neighbours, and calculates a Local Outlier Factor as the average ratio between its own density and the neighbours. By doing so, it can find local outliers that a global method might not find.+Local Outlier Factor (LOF) is an unsupervised method of finding outliers through a data points local density, introduced in 2000 by Breunig et al. [[https://www.dbs.ifi.lmu.de/Publikationen/Papers/LOF.pdf|[1]]]. It compares the local density of each point to its local neighbours, and calculates a Local Outlier Factor as the average ratio between its own density and the neighbours. By doing so, it can find local outliers that a global method might not find.
  
 {{:Sift_LOF_example.png}} {{:Sift_LOF_example.png}}
Line 15: Line 15:
 The LOF is a statistic calculated for each point in the database, with values below 1 representing inliers (as they are in a denser neighbourhood than their k neighbours), while outliers are determined based on a threshold above 1, which should be tuned depending on the data being used. For more information on the implementation and guidelines for using this method, see the [[Sift:Tutorials:Outlier_Detection_with_PCA#Local_Outlier_Factor|LOF section of our PCA Outlier Detection Tutorial]]. The LOF is a statistic calculated for each point in the database, with values below 1 representing inliers (as they are in a denser neighbourhood than their k neighbours), while outliers are determined based on a threshold above 1, which should be tuned depending on the data being used. For more information on the implementation and guidelines for using this method, see the [[Sift:Tutorials:Outlier_Detection_with_PCA#Local_Outlier_Factor|LOF section of our PCA Outlier Detection Tutorial]].
  
-== Algorithm ==+=== Algorithm ===
  
-<HTML><ol></HTML> 
-<HTML><li></HTML>Calculate the k-nearest neighbours: 
-Using a distance metric (such as Euclidean Distance), calculate the k nearest neighbours for all data points.<HTML></li></HTML> 
-<HTML><li></HTML>Calculate reachability distance: 
-For each point, calculate the reachability distance between itself and its k neighbours, defined as the maximum of: 
-<HTML><ol></HTML> 
-<HTML><li></HTML>The distance between the 2 points.<HTML></li></HTML> 
-<HTML><li></HTML>The distance from the neighbouring point to **its own** k-th nearest neighbour (known as the k-distance of the neighbouring point)<HTML></li></HTML><HTML></ol></HTML> 
  
-//Note that the use of reachability distance helps to normalize or smooth the output, by reducing the effects of random fluctuations (i.e. random points that are extremely close together)// +  - Calculate the k-nearest neighbours: \\ Using a distance metric (such as Euclidean Distance), calculate the k nearest neighbours for all data points. 
-The reachability distance from p1 to o, and p2 to o is shown below: +  - Calculate reachability distance: \\ For each point, calculate the reachability distance between itself and its k neighbours, defined as the maximum of: 
-{{:dist.png}}<HTML></li></HTML> +    - The distance between the 2 points. 
-<HTML><li></HTML>Calculate local reachability density: +    - The distance from the neighbouring point to **its own** k-th nearest neighbour (known as the k-distance of the neighbouring point) \\ //Note that the use of reachability distance helps to normalize or smooth the output, by reducing the effects of random fluctuations (i.e. random points that are extremely close together)// \\ The reachability distance from p1 to o, and p2 to o is shown below: \\ {{:Sift_LOF__reach-dist.png}} 
-For each point, the local reachability density is 1 / (the average reachability distance to the k nearest neighbours), i.e.: +  Calculate local reachability density: \\ For each point, the local reachability density is 1 / (the average reachability distance to the k nearest neighbours), i.e.: \\ {{:Sift_LOF__reach-density.png}} 
-{{:density.png}}<HTML></li></HTML> +  Calculate local outlier factor: \\ For each point, the local outlier factor is the average ratio between the points local reachability distance and each of the local reachability distance of the k nearest neighbours, i.e. \\ {{:Sift_LOF_lof.png}} 
-<HTML><li></HTML>Calculate local outlier factor: +  Threshold to find outliers: \\ Identify a threshold for which to determine points are outliers. This threshold should be some point above 1 (as a LOF of > 1 represents lower density than its neighbours), and should be done as a case by case basis. An automated method to choose a threshold is to use an iterated one-sided Grubbs outlier test on the LOF values, to determine outliers with a significance level alpha.
-For each point, the local outlier factor is the average ratio between the points local reachability distance and each of the local reachability distance of the k nearest neighbours, i.e. +
-{{:Sift_LOF_lof.png}}<HTML></li></HTML> +
-<HTML><li></HTML>Threshold to find outliers: +
-Identify a threshold for which to determine points are outliers. This threshold should be some point above 1 (as a LOF of > 1 represents lower density than its neighbours), and should be done as a case by case basis. An automated method to choose a threshold is to use an iterated one-sided Grubbs outlier test on the LOF values, to determine outliers with a significance level alpha.<HTML></li></HTML><HTML></ol></HTML>+
  
-=== Grubbs' Outlier Test ===+==== Grubbs' Outlier Test ====
  
 The Grubbs' Outlier test is a method of finding a univariate outlier in normally distributed data. The test statistic represents the largest deviation from the sample mean (in terms of the standard deviation), and is compared against a t-distribution representing the stated significance level. The statistic is calculated as follows (for a one sided-maximal value test): The Grubbs' Outlier test is a method of finding a univariate outlier in normally distributed data. The test statistic represents the largest deviation from the sample mean (in terms of the standard deviation), and is compared against a t-distribution representing the stated significance level. The statistic is calculated as follows (for a one sided-maximal value test):
Line 50: Line 38:
 If we successfully reject the null hypothesis, we remove the outlier from the data, and calculate the statistic again on the new data. We can continue this until no outliers are detected, or stop after X outliers have been identified (X being up to the user). If we successfully reject the null hypothesis, we remove the outlier from the data, and calculate the statistic again on the new data. We can continue this until no outliers are detected, or stop after X outliers have been identified (X being up to the user).
  
-=== Mahalanobis Distance Test ===+==== Mahalanobis Distance Test ====
  
 The Mahalanobis distance is a distance measure between a point and a distribution, which accounts for the covariance between each dimension in the distribution, essentially measuring the distance accounting for dependencies between dimensions, and the total variance along each dimension. This can be useful for finding datapoints which are close to inlier data, but don't follow the general trend observed. The Mahalanobis distance is a distance measure between a point and a distribution, which accounts for the covariance between each dimension in the distribution, essentially measuring the distance accounting for dependencies between dimensions, and the total variance along each dimension. This can be useful for finding datapoints which are close to inlier data, but don't follow the general trend observed.
Line 66: Line 54:
 Where X is the vector difference of the point from the distribution mean, and S^-1 is the inverse of the variance-covariance matrix of the distribution. Where X is the vector difference of the point from the distribution mean, and S^-1 is the inverse of the variance-covariance matrix of the distribution.
  
-The square of the Mahalanobis Distance (d^2) follows a chi-square distribution, and as such we can use this as a test statistic: The Mahalanobis Distance Test.+The square of the Mahalanobis Distance (d<sup>2</sup>) follows a chi-square distribution, and as such we can use this as a test statistic: The Mahalanobis Distance Test.
  
 We produce this statistic for each data point being observed, and choose the null hypothesis to be that this point was drawn from the specified multi-variate normal distribution. We can reject this null hypothesis (at significance level alpha) if (i.e. the point is an outlier): We produce this statistic for each data point being observed, and choose the null hypothesis to be that this point was drawn from the specified multi-variate normal distribution. We can reject this null hypothesis (at significance level alpha) if (i.e. the point is an outlier):
Line 72: Line 60:
 {{:Sift_Mahalanobis_chi_square.png}} {{:Sift_Mahalanobis_chi_square.png}}
  
-Where X^2(chi^2) is the chi-square distribution with n dimensions, at significance level alpha.+Where X<sup>2</sup>(chi<sup>2</sup>) is the chi-square distribution with n dimensions, at significance level alpha.
  
-If we successfully reject the null hypothesis, we remove the outlier from the data, and calculate the statistic again on until we have calculated it on each data point. We can recalculate the new covariance, and continue this until no outliers are detected, or stop after X iterations have occurred (X being up to the user).+If we successfully reject the null hypothesis, we remove the outlier from the data, and calculate the statistic again or until we have calculated it on each data point. We can recalculate the new covariance, and continue this until no outliers are detected, or stop after X iterations have occurred (X being up to the user).
  
-=== Squared Prediction Error (SPE) ===+==== Squared Prediction Error (SPE) ====
  
 SPE is a distance measure between the true measurement of a datapoint, and the predicted measurement of the datapoint. Unlike Mahalanobis Distance, this does not account for any of the variance in each dimension, but the total actual Euclidean distance. This is useful for finding data that is spatially far from a predicted value, even if it is well within a specified trend. SPE is a distance measure between the true measurement of a datapoint, and the predicted measurement of the datapoint. Unlike Mahalanobis Distance, this does not account for any of the variance in each dimension, but the total actual Euclidean distance. This is useful for finding data that is spatially far from a predicted value, even if it is well within a specified trend.
Line 82: Line 70:
 {{:Sift_SPE_Example.png}} {{:Sift_SPE_Example.png}}
  
-The example above shows an extended version of the Mahalanobis Distance example from above (in 3 dimensions), with a new datapoint+The example above shows an extended version of the Mahalanobis Distance example from above (in 3 dimensions), with a new datapoint "Outlier 3" in teal, and it's projection into the original 2 dimensions in yellow.
  
-In the context of a PCA analysis, the SPE is a measurement of the PCA reconstruction error (i.e. how far off a PCA reconstruction from a lower dimensional space is from the original data). This is equivalent to the distance (squared) from the original data, to it's projection onto the PCA reduced k-dimension hyperplane, and is calculated as follows:+In the context of a PCA analysis, the SPE is a measurement of the PCA reconstruction error (i.e. how far off a PCA reconstruction from a lower dimensional space is from the ground truth). This is equivalent to the distance (squared) from the original data, to it's projection onto the PCA reduced k-dimension hyperplane, and is calculated as follows:
  
 {{:Sift_SPE.png}} {{:Sift_SPE.png}}
  
-Where.....+Where Q represents the SPE, r<sub>i</sub> is the ith residual value (between the projection and the sample), x<sub>i</sub> is the sample, I is the identity matrix, and P<sub>l</sub> is the projection matrix onto l principal components.
  
-The SPE is a direct+The SPE complements the Mahalanobis Distance, and is often used in conjunction with it to determine the outliers.
  
-==== Reference ====+The SPE follows a non-central chi-square distribution, and as such we can use this as a test statistic: The SPE Test. 
 + 
 +We produce this statistic for each data point being observed, and choose the null hypothesis to be that this point was drawn from the specified non-central chi-square distribution. We can reject this null hypothesis (at significance level alpha) if Q is greater than Q<sub>a</sub>(i.e. the point is an outlier): 
 + 
 +{{:Sift_Mahalanobis_noncentral_chi_square.png}} 
 + 
 +[[https://hrcak.srce.hr/file/117623|[2]]] 
 + 
 +If we successfully reject the null hypothesis, we remove the outlier from the data, and calculate the statistic again or until we have calculated it on each data point. We can recalculate the new covariance, and continue this until no outliers are detected, or stop after X iterations have occurred (X being up to the user). 
 + 
 +===== Reference =====
  
 Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. 2000. LOF: identifying density-based local outliers. SIGMOD Rec. 29, 2 (June 2000), 93–104. https://doi.org/10.1145/335191.335388 Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. 2000. LOF: identifying density-based local outliers. SIGMOD Rec. 29, 2 (June 2000), 93–104. https://doi.org/10.1145/335191.335388
 +
 **Abstract** **Abstract**
 +
 For many KDD applications, such as detecting criminal activities in E-commerce, finding the rare instances or the outliers, can be more interesting than finding the common patterns. Existing work in outlier detection regards being an outlier as a binary property. In this paper, we contend that for many scenarios, it is more meaningful to assign to each object a degree of being an outlier. This degree is called the local outlier factor (LOF) of an object. It is local in that the degree depends on how isolated the object is with respect to the surrounding neighborhood. We give a detailed formal analysis showing that LOF enjoys many desirable properties. Using real-world datasets, we demonstrate that LOF can be used to find outliers which appear to be meaningful, but can otherwise not be identified with existing approaches. Finally, a careful performance evaluation of our algorithm confirms we show that our approach of finding local outliers can be practical. For many KDD applications, such as detecting criminal activities in E-commerce, finding the rare instances or the outliers, can be more interesting than finding the common patterns. Existing work in outlier detection regards being an outlier as a binary property. In this paper, we contend that for many scenarios, it is more meaningful to assign to each object a degree of being an outlier. This degree is called the local outlier factor (LOF) of an object. It is local in that the degree depends on how isolated the object is with respect to the surrounding neighborhood. We give a detailed formal analysis showing that LOF enjoys many desirable properties. Using real-world datasets, we demonstrate that LOF can be used to find outliers which appear to be meaningful, but can otherwise not be identified with existing approaches. Finally, a careful performance evaluation of our algorithm confirms we show that our approach of finding local outliers can be practical.
 +
 +----
 +
 +Slišković, Dražen, Ratko Grbić, and Željko Hocenski. "Multivariate statistical process monitoring." Tehnički vjesnik 19.1 (2012): 33-41.
 +
 +**Abstract**
 +
 +Demands regarding production efficiency, product quality, safety levels and environment protection are continuously increasing in the process industry. The way to accomplish these demands is to introduce ever more complex automatic control systems which require more process variables to be measured and more advanced measurement systems. Quality and reliable measurements of process variables are the basis for the quality process control. Process equipment failures can significantly deteriorate production process and even cause production outage, resulting in high additional costs. This paper analyzes automatic fault detection and identification of process measurement equipment, i.e. sensors. Different statistical methods can be used for this purpose in a way that continuously acquired measurements are analyzed by these methods. In this paper, PCA and ICA methods are used for relationship modelling which exists between process variables while Hotelling's ( ), and (SPE) statistics are used for fault detection because they provide an indication of unusual variability within and outside normal process workspace. Contribution plots are used for fault identification. The algorithms for the statistical process monitoring based on PCA and ICA methods are derived and applied to the two processes of different complexity.Apart from that, their fault detection ability is mutually compared.
 +
 +
  
  
sift/principal_component_analysis/outlier_detection_for_pca.1721229738.txt.gz · Last modified: 2024/07/17 15:22 by sgranger