User Tools

Site Tools


sift:tutorials:outlier_detection_with_pca

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
sift:tutorials:outlier_detection_with_pca [2024/08/28 17:28] – [Mahalanobis Distance and SPE Tests] wikisysopsift:tutorials:outlier_detection_with_pca [2024/08/28 17:54] (current) – [Mahalanobis Distance and SPE Tests] wikisysop
Line 81: Line 81:
 We can clearly see an issue with using Euclidean Distance: while the dimensions do not covary (as PCA creates an independent basis), the PC1 clearly has a larger variance that should be accounted for. We are interested in automatically and algorithmic-ally removing the outliers that don't follow our general observations, and will use the Mahalanobis Distance Test and SPE to do so. To do so, open the {{:sift_outlier_detection.png?20}} **Outlier Detection Using PCA** dropdown on the toolbar, and select "Mahalanobis Distance Test and SPE". We can clearly see an issue with using Euclidean Distance: while the dimensions do not covary (as PCA creates an independent basis), the PC1 clearly has a larger variance that should be accounted for. We are interested in automatically and algorithmic-ally removing the outliers that don't follow our general observations, and will use the Mahalanobis Distance Test and SPE to do so. To do so, open the {{:sift_outlier_detection.png?20}} **Outlier Detection Using PCA** dropdown on the toolbar, and select "Mahalanobis Distance Test and SPE".
  
-A pop-up window will appear, allowing you to customize some features for both test calculations. For this first test, we should set the parameters as below (in the image):+A pop-up window will appear, allowing you to customize some features for both test calculations. For this test, we should set the parameters as below (in the image):
  
 {{:Sift_Mahalanobis_Tutorial_MahDialog1.png | Mahalanobis Distance Test Dialog}} {{:Sift_Mahalanobis_Tutorial_MahDialog1.png | Mahalanobis Distance Test Dialog}}
Line 89: Line 89:
 We choose to use 2 PCs, as the workspace scores are presented in 2 dimensions. We choose to use 2 PCs, as the workspace scores are presented in 2 dimensions.
  
-The Outlier alpha value is a tuneable parameter within the Mahalanobis Distance Calculation, which will determine the distance threshold for a data point to be classified as an inlier/outlier. The alpha value corresponds to the statistical significance of the chi-square distribution being compared to (i.e. a lower alpha value corresponds to a higher threshold needed to be an outlier, but more certainty in the results).+The Outlier alpha value is a tuneable parameter within the Mahalanobis Distance Calculation, which will determine the distance threshold for a data point to be classified as an inlier/outlier. The alpha value corresponds to the statistical significance of the chi-square distribution being compared to (i.e. a lower alpha value corresponds to a higher threshold needed to be an outlier, but more certainty in the results). Within SPE, the alpha parameter is similarly used, but corresponds to a non-central chi-square distribution instead.
  
 Upon running the Mahalanobis Distance Test, you should see the following datapoints highlighted (note that we rescaled the PC1 axis to have equivalent spacing to the PC2 axis): Upon running the Mahalanobis Distance Test, you should see the following datapoints highlighted (note that we rescaled the PC1 axis to have equivalent spacing to the PC2 axis):
Line 97: Line 97:
 We can immediately see that there are outliers whose (rough) Euclidean distance is much smaller than some of the inliers (ex: points near the PC1 axis at roughly +/-500 in the PC2 axis are closer than those along the PC2 axis at roughly +/-1000 in the PC1 axis). Using a normal distance measure would identify these differently, but because we used the Mahalanobis Distance Test, we were able to identify those that do not follow the general trend of the data. We can immediately see that there are outliers whose (rough) Euclidean distance is much smaller than some of the inliers (ex: points near the PC1 axis at roughly +/-500 in the PC2 axis are closer than those along the PC2 axis at roughly +/-1000 in the PC1 axis). Using a normal distance measure would identify these differently, but because we used the Mahalanobis Distance Test, we were able to identify those that do not follow the general trend of the data.
  
 +Upon running the SPE Test, you should see the following datapoints highlighted:
  
 +{{:Sift_Mahalanobis_Tutorial_Results2.png | SPE Test Combined Results}}
  
 +We notice that the outliers are seemingly random, some far away from the origin, while others are well within the 2D cluster. This is because SPE measures the reconstruction error! And not error within the lower dimensional representation. We can clearly see this through the PC Reconstruction page, if we plot one of the identified outliers (Sub04::OG_LA_run01.c3d::Frames(419,529)):
 +
 +{{:sift_mahalanobis_tutorial_reconstruction.png | PCA Reconstruction Results}}
 +
 +We can see there are some very clear errors in comparison to the ground truth! If you plot an inlier, you would see significantly less errors in this projection.
sift/tutorials/outlier_detection_with_pca.1724866106.txt.gz · Last modified: 2024/08/28 17:28 by wikisysop