sift:tutorials:outlier_detection_with_pca
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
sift:tutorials:outlier_detection_with_pca [2024/08/28 15:58] – wikisysop | sift:tutorials:outlier_detection_with_pca [2024/08/28 17:54] (current) – [Mahalanobis Distance and SPE Tests] wikisysop | ||
---|---|---|---|
Line 65: | Line 65: | ||
Here we notice that (among others) several points in the bottom left side are selected as outliers, and several points that were initially identified as outliers in the top right no longer are. The reason for both of these is simply that they are/ | Here we notice that (among others) several points in the bottom left side are selected as outliers, and several points that were initially identified as outliers in the top right no longer are. The reason for both of these is simply that they are/ | ||
- | ==== Mahalanobis Distance | + | ==== Mahalanobis Distance |
- | Automatically finding outliers saves time and effort, and can lead to more objective results, if it is done based on some criteria or test. There are very simple methods that could be employed, like finding the data that is farthest away from the mean and excluding some portion of those, or those that are further than some distance away. As we see in the [[Sift: | + | Automatically finding outliers saves time and effort, and can lead to more objective results, if it is done based on some criteria or test. There are very simple methods that could be employed, like finding the data that is farthest away from the mean and excluding some portion of those, or those that are further than some distance away. As we see in the [[Sift: |
- | In Sift, the Mahalanobis Distance Test is also built upon the PCA module, to find outliers in the PC workspace scores. As such, we will need to create a PCA analysis. To show the benefits of Mahalanobis Distance Test, we will be using the group " | + | In Sift, both methods |
* HipPower_X selected (and all workspaces) | * HipPower_X selected (and all workspaces) | ||
* 4 PCs calculated | * 4 PCs calculated | ||
Line 79: | Line 79: | ||
{{: | {{: | ||
- | We can clearly see an issue with using Euclidean Distance: while the dimensions do not covary (as PCA creates an independent basis), the PC1 clearly has a larger variance that should be accounted for. We are interested in automatically and algorithmic-ally removing the outliers that don't follow our general observations, | + | We can clearly see an issue with using Euclidean Distance: while the dimensions do not covary (as PCA creates an independent basis), the PC1 clearly has a larger variance that should be accounted for. We are interested in automatically and algorithmic-ally removing the outliers that don't follow our general observations, |
- | A pop-up window will appear, allowing you to customize some features | + | A pop-up window will appear, allowing you to customize some features |
{{: | {{: | ||
Line 89: | Line 89: | ||
We choose to use 2 PCs, as the workspace scores are presented in 2 dimensions. | We choose to use 2 PCs, as the workspace scores are presented in 2 dimensions. | ||
- | The Outlier alpha value is a tuneable parameter within the Mahalanobis Distance Calculation, | + | The Outlier alpha value is a tuneable parameter within the Mahalanobis Distance Calculation, |
Upon running the Mahalanobis Distance Test, you should see the following datapoints highlighted (note that we rescaled the PC1 axis to have equivalent spacing to the PC2 axis): | Upon running the Mahalanobis Distance Test, you should see the following datapoints highlighted (note that we rescaled the PC1 axis to have equivalent spacing to the PC2 axis): | ||
Line 97: | Line 97: | ||
We can immediately see that there are outliers whose (rough) Euclidean distance is much smaller than some of the inliers (ex: points near the PC1 axis at roughly +/-500 in the PC2 axis are closer than those along the PC2 axis at roughly +/-1000 in the PC1 axis). Using a normal distance measure would identify these differently, | We can immediately see that there are outliers whose (rough) Euclidean distance is much smaller than some of the inliers (ex: points near the PC1 axis at roughly +/-500 in the PC2 axis are closer than those along the PC2 axis at roughly +/-1000 in the PC1 axis). Using a normal distance measure would identify these differently, | ||
+ | Upon running the SPE Test, you should see the following datapoints highlighted: | ||
+ | {{: | ||
+ | We notice that the outliers are seemingly random, some far away from the origin, while others are well within the 2D cluster. This is because SPE measures the reconstruction error! And not error within the lower dimensional representation. We can clearly see this through the PC Reconstruction page, if we plot one of the identified outliers (Sub04:: | ||
+ | |||
+ | {{: | ||
+ | |||
+ | We can see there are some very clear errors in comparison to the ground truth! If you plot an inlier, you would see significantly less errors in this projection. |
sift/tutorials/outlier_detection_with_pca.1724860690.txt.gz · Last modified: 2024/08/28 15:58 by wikisysop