sift:principal_component_analysis:outlier_detection_for_pca
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
sift:principal_component_analysis:outlier_detection_for_pca [2024/08/28 18:32] – wikisysop | sift:principal_component_analysis:outlier_detection_for_pca [2024/08/28 19:04] (current) – [Reference] wikisysop | ||
---|---|---|---|
Line 7: | Line 7: | ||
==== Local Outlier Factor ==== | ==== Local Outlier Factor ==== | ||
- | Local Outlier Factor (LOF) is an unsupervised method of finding outliers through a data points local density, introduced in 2000 by Breunig et al. [[[https:// | + | Local Outlier Factor (LOF) is an unsupervised method of finding outliers through a data points local density, introduced in 2000 by Breunig et al. [[https:// |
{{: | {{: | ||
Line 18: | Line 18: | ||
- | - Calculate the k-nearest neighbours: | + | - Calculate the k-nearest neighbours: |
- | Using a distance metric (such as Euclidean Distance), calculate the k nearest neighbours for all data points. | + | - Calculate reachability distance: |
- | - Calculate reachability distance: | + | |
- | For each point, calculate the reachability distance between itself and its k neighbours, defined as the maximum of: | + | |
- The distance between the 2 points. | - The distance between the 2 points. | ||
- | - The distance from the neighbouring point to **its own** k-th nearest neighbour (known as the k-distance of the neighbouring point) | + | - The distance from the neighbouring point to **its own** k-th nearest neighbour (known as the k-distance of the neighbouring point) |
- | + | - Calculate local reachability density: | |
- | //Note that the use of reachability distance helps to normalize or smooth the output, by reducing the effects of random fluctuations (i.e. random points that are extremely close together)// | + | - Calculate local outlier factor: |
- | The reachability distance from p1 to o, and p2 to o is shown below: | + | - Threshold to find outliers: |
- | {{: | + | |
- | - Calculate local reachability density: | + | |
- | For each point, the local reachability density is 1 / (the average reachability distance to the k nearest neighbours), | + | |
- | {{: | + | |
- | - Calculate local outlier factor: | + | |
- | For each point, the local outlier factor is the average ratio between the points local reachability distance and each of the local reachability distance of the k nearest neighbours, i.e. | + | |
- | {{: | + | |
- | - Threshold to find outliers: | + | |
- | Identify a threshold for which to determine points are outliers. This threshold should be some point above 1 (as a LOF of > 1 represents lower density than its neighbours), | + | |
==== Grubbs' | ==== Grubbs' | ||
Line 65: | Line 54: | ||
Where X is the vector difference of the point from the distribution mean, and S^-1 is the inverse of the variance-covariance matrix of the distribution. | Where X is the vector difference of the point from the distribution mean, and S^-1 is the inverse of the variance-covariance matrix of the distribution. | ||
- | The square of the Mahalanobis Distance (d^2) follows a chi-square distribution, | + | The square of the Mahalanobis Distance (d<sup>2</ |
We produce this statistic for each data point being observed, and choose the null hypothesis to be that this point was drawn from the specified multi-variate normal distribution. We can reject this null hypothesis (at significance level alpha) if (i.e. the point is an outlier): | We produce this statistic for each data point being observed, and choose the null hypothesis to be that this point was drawn from the specified multi-variate normal distribution. We can reject this null hypothesis (at significance level alpha) if (i.e. the point is an outlier): | ||
Line 71: | Line 60: | ||
{{: | {{: | ||
- | Where X^2(chi^2) is the chi-square distribution with n dimensions, at significance level alpha. | + | Where X<sup>2</ |
- | If we successfully reject the null hypothesis, we remove the outlier from the data, and calculate the statistic again on until we have calculated it on each data point. We can recalculate the new covariance, and continue this until no outliers are detected, or stop after X iterations have occurred (X being up to the user). | + | If we successfully reject the null hypothesis, we remove the outlier from the data, and calculate the statistic again or until we have calculated it on each data point. We can recalculate the new covariance, and continue this until no outliers are detected, or stop after X iterations have occurred (X being up to the user). |
==== Squared Prediction Error (SPE) ==== | ==== Squared Prediction Error (SPE) ==== | ||
Line 81: | Line 70: | ||
{{: | {{: | ||
- | The example above shows an extended version of the Mahalanobis Distance example from above (in 3 dimensions), | + | The example above shows an extended version of the Mahalanobis Distance example from above (in 3 dimensions), |
- | In the context of a PCA analysis, the SPE is a measurement of the PCA reconstruction error (i.e. how far off a PCA reconstruction from a lower dimensional space is from the original data). This is equivalent to the distance (squared) from the original data, to it's projection onto the PCA reduced k-dimension hyperplane, and is calculated as follows: | + | In the context of a PCA analysis, the SPE is a measurement of the PCA reconstruction error (i.e. how far off a PCA reconstruction from a lower dimensional space is from the ground truth). This is equivalent to the distance (squared) from the original data, to it's projection onto the PCA reduced k-dimension hyperplane, and is calculated as follows: |
{{: | {{: | ||
- | Where..... | + | Where Q represents the SPE, r< |
- | The SPE is a direct | + | The SPE complements the Mahalanobis Distance, and is often used in conjunction with it to determine the outliers. |
+ | |||
+ | The SPE follows | ||
+ | |||
+ | We produce this statistic for each data point being observed, and choose the null hypothesis to be that this point was drawn from the specified non-central chi-square distribution. We can reject this null hypothesis (at significance level alpha) if Q is greater than Q< | ||
+ | |||
+ | {{: | ||
+ | |||
+ | [[https:// | ||
+ | |||
+ | If we successfully reject the null hypothesis, we remove the outlier from the data, and calculate the statistic again or until we have calculated it on each data point. We can recalculate the new covariance, and continue this until no outliers are detected, or stop after X iterations have occurred (X being up to the user). | ||
===== Reference ===== | ===== Reference ===== | ||
Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. 2000. LOF: identifying density-based local outliers. SIGMOD Rec. 29, 2 (June 2000), 93–104. https:// | Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. 2000. LOF: identifying density-based local outliers. SIGMOD Rec. 29, 2 (June 2000), 93–104. https:// | ||
+ | |||
**Abstract** | **Abstract** | ||
+ | |||
For many KDD applications, | For many KDD applications, | ||
+ | |||
+ | ---- | ||
+ | |||
+ | Slišković, | ||
+ | |||
+ | **Abstract** | ||
+ | |||
+ | Demands regarding production efficiency, product quality, safety levels and environment protection are continuously increasing in the process industry. The way to accomplish these demands is to introduce ever more complex automatic control systems which require more process variables to be measured and more advanced measurement systems. Quality and reliable measurements of process variables are the basis for the quality process control. Process equipment failures can significantly deteriorate production process and even cause production outage, resulting in high additional costs. This paper analyzes automatic fault detection and identification of process measurement equipment, i.e. sensors. Different statistical methods can be used for this purpose in a way that continuously acquired measurements are analyzed by these methods. In this paper, PCA and ICA methods are used for relationship modelling which exists between process variables while Hotelling' | ||
+ | |||
+ | |||
sift/principal_component_analysis/outlier_detection_for_pca.1724869923.txt.gz · Last modified: 2024/08/28 18:32 by wikisysop