User Tools

Site Tools


sift:tutorials:outlier_detection_with_pca

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
sift:tutorials:outlier_detection_with_pca [2024/08/28 15:43] wikisysopsift:tutorials:outlier_detection_with_pca [2024/08/28 17:54] (current) – [Mahalanobis Distance and SPE Tests] wikisysop
Line 17: Line 17:
 [[Sift:Principal_Component_Analysis:Outlier_Detection_for_PCA#Local_Outlier_Factor|Local Outlier Factor(LOF)]] is a outlier detection method that uses the local density around data points to determine if a point is an outlier. In this sense it can find outliers that global detection methods would not, as it identifies outliers in local areas. [[Sift:Principal_Component_Analysis:Outlier_Detection_for_PCA#Local_Outlier_Factor|Local Outlier Factor(LOF)]] is a outlier detection method that uses the local density around data points to determine if a point is an outlier. In this sense it can find outliers that global detection methods would not, as it identifies outliers in local areas.
  
-In Sift, LOF is built upon the [[Sift:Principal_Component_Analysis:Using_Principal_Component_Analysis_in_Biomechanics|PCA module]], to find outliers in the PC workspace scores. As such, we will need to create a PCA analysis. To show the benefits of Local Outlier Factor, we will be using the group "HipAngle_Z", as it has a good shape to demonstrate the effectiveness of LOF (multiple clusters of varying density). Specifically, create a PCA on HipAngle_Z with all workspaces selected, 4 PCs calculated"Use Workspace Mean" unchecked, and named "PCA_HipZ".+In Sift, LOF is built upon the [[Sift:Principal_Component_Analysis:Using_Principal_Component_Analysis_in_Biomechanics|PCA module]], to find outliers in the PC workspace scores. As such, we will need to create a PCA analysis. To show the benefits of Local Outlier Factor, we will be using the group "HipAngle_Z", as it has a good shape to demonstrate the effectiveness of LOF (multiple clusters of varying density). Specifically, create a PCA with
 +  * HipAngle_Z selected (and all workspaces
 +  * 4 PCs calculated 
 +  * "Use Workspace Mean" unchecked 
 +  * Named "PCA_HipZ"
  
 After calculating the PCA Results, the Workspace Scores on the analyse page should look as follows (note that the points are coloured by group. If they were coloured by workspace, you would see many of these clusters correspond to workspaces): After calculating the PCA Results, the Workspace Scores on the analyse page should look as follows (note that the points are coloured by group. If they were coloured by workspace, you would see many of these clusters correspond to workspaces):
Line 25: Line 29:
 We can see several distinct clusters, in varying positions (and none of which are located at the origin of the plot!). This could cause issues if we were to use "global" outlier detection methods: individual data points may appear to be inliers globally but are actually locally an outlier (i.e. if it is between but not in any of several clusters), or vice-versa, a point (or cluster of points) may be at the edge of the global distribution, but well within a local cluster distribution. We can see several distinct clusters, in varying positions (and none of which are located at the origin of the plot!). This could cause issues if we were to use "global" outlier detection methods: individual data points may appear to be inliers globally but are actually locally an outlier (i.e. if it is between but not in any of several clusters), or vice-versa, a point (or cluster of points) may be at the edge of the global distribution, but well within a local cluster distribution.
  
-Whichever the situation may be, we are interested in finding outliers in our data, and making a decision upon finding them. We will start by running a LOF calculation. To do so, open the {{:sift_outlier_detection.png}} **Outlier Detection Using PCA** dropdown on the toolbar, and select "Local Outlier Factor".+Whichever the situation may be, we are interested in finding outliers in our data, and making a decision upon finding them. We will start by running a LOF calculation. To do so, open the {{:sift_outlier_detection.png?20}} **Outlier Detection Using PCA** dropdown on the toolbar, and select "Local Outlier Factor".
  
 A pop-up window will appear, allowing you to customize some features of the LOF calculation. For this first test, we should set the parameters as below (in the image): A pop-up window will appear, allowing you to customize some features of the LOF calculation. For this first test, we should set the parameters as below (in the image):
Line 61: Line 65:
 Here we notice that (among others) several points in the bottom left side are selected as outliers, and several points that were initially identified as outliers in the top right no longer are. The reason for both of these is simply that they are/aren't outliers in different contexts. The new outliers belong to a workspace for which most data points are not in the immediate local area, even though data points from other workspaces are in the local area. Similarly, the new inliers in the top right are determined to be inliers because when workspaces were combined, multiple workspaces overlapped in that local area, increasing the density for their neighbors. Some of these points were only marginally determined to be outliers (close to a score of 2), which could represent that the threshold value of 2.0 was a little low. Here we notice that (among others) several points in the bottom left side are selected as outliers, and several points that were initially identified as outliers in the top right no longer are. The reason for both of these is simply that they are/aren't outliers in different contexts. The new outliers belong to a workspace for which most data points are not in the immediate local area, even though data points from other workspaces are in the local area. Similarly, the new inliers in the top right are determined to be inliers because when workspaces were combined, multiple workspaces overlapped in that local area, increasing the density for their neighbors. Some of these points were only marginally determined to be outliers (close to a score of 2), which could represent that the threshold value of 2.0 was a little low.
  
-==== Mahalanobis Distance Test ====+==== Mahalanobis Distance and SPE Tests ====
  
-Automatically finding outliers saves time and effort, and can lead to more objective results, if it is done based on some criteria or test. There are very simple methods that could be employed, like finding the data that is farthest away from the mean and excluding some portion of those, or those that are further than some distance away. As we see in the [[Sift:Principal_Component_Analysis:Outlier_Detection_for_PCA#Mahalanobis_Distance_Test|Mahalanobis Distance section of our Outlier Detection methods]], the distance measure we use can be very important: Data may be located close to the mean data, but outlying in comparison to the general trend of the data, or it may be located far from the mean data, but in line with the trend and just towards an upper limit along an axis. The Mahalanobis Distance Test is an outlier detection method that uses the underlying covariance relationship of data to determine outliers which don't follow the general trend of the data.+Automatically finding outliers saves time and effort, and can lead to more objective results, if it is done based on some criteria or test. There are very simple methods that could be employed, like finding the data that is farthest away from the mean and excluding some portion of those, or those that are further than some distance away. As we see in the [[Sift:Principal_Component_Analysis:Outlier_Detection_for_PCA#Mahalanobis_Distance_Test|Mahalanobis Distance ]], the distance measure we use can be very important: Data may be located close to the mean data, but outlying in comparison to the general trend of the data, or it may be located far from the mean data, but in line with the trend and just towards an upper limit along an axis. The Mahalanobis Distance Test is an outlier detection method that uses the underlying covariance relationship of data to determine outliers which don't follow the general trend of the data. In conjunction with the Mahalanobis Distance test, we also have the [[Sift:Principal_Component_Analysis:Outlier_Detection_for_PCA#Squared_Prediction_Error_(SPE)|Squared Prediction Error]]. This is a test which measures how far a prediction is from the real data, or in this case, how far our PC-decomposed representation is from the real data. These 2 methods work well together: we can measure how close a single prediction follows the general trend, and we can measure how a single prediction compares to the ground truth. This is why we have bundled them together in Sift.
  
-In Sift, the Mahalanobis Distance Test is also built upon the PCA module, to find outliers in the PC workspace scores. As such, we will need to create a PCA analysis. To show the benefits of Mahalanobis Distance Test, we will be using the group "HipPower_X", as it has a good shape to demonstrate the effectiveness of Mahalanobis Distance Tests (most data following a specific trend, with some close by outliers not following the same trend). Specifically, create a PCA on HipPower_X with all workspaces selected, 4 PCs calculated"Use Workspace Mean" unchecked, and named "PCA_HipX".+In Sift, both methods also built upon the PCA module, to find outliers in the PC workspace scores. As such, we will need to create a PCA analysis. To show the benefits of Mahalanobis Distance Test and SPE, we will be using the group "HipPower_X", as it has a good shape to demonstrate the effectiveness of Mahalanobis Distance Tests (most data following a specific trend, with some close by outliers not following the same trend). Specifically, create a PCA with
 +  * HipPower_X selected (and all workspaces
 +  * 4 PCs calculated 
 +  * "Use Workspace Mean" unchecked 
 +  * Named "PCA_HipX"
  
 After calculating the PCA Results, the Workspace Scores on the analyse page should look as follows (note that the points are coloured by group. If they were coloured by workspace, you would see many of these clusters correspond to workspaces): After calculating the PCA Results, the Workspace Scores on the analyse page should look as follows (note that the points are coloured by group. If they were coloured by workspace, you would see many of these clusters correspond to workspaces):
Line 71: Line 79:
 {{:Sift_Mahalanobis_Tutorial_PCAHipX.png | Hip Power X PCA Results}} {{:Sift_Mahalanobis_Tutorial_PCAHipX.png | Hip Power X PCA Results}}
  
-We can clearly see an issue with using Euclidean Distance: while the dimensions do not covary (as PCA creates an independent basis), the PC1 clearly has a larger variance that should be accounted for. We are interested in automatically and algorithmic-ally removing the outliers that don't follow our general observations, and will use the Mahalanobis Distance Test to do so. To do so, open the {{:sift_outlier_detection.png}} **Outlier Detection Using PCA** dropdown on the toolbar, and select "Mahalanobis Distance Test".+We can clearly see an issue with using Euclidean Distance: while the dimensions do not covary (as PCA creates an independent basis), the PC1 clearly has a larger variance that should be accounted for. We are interested in automatically and algorithmic-ally removing the outliers that don't follow our general observations, and will use the Mahalanobis Distance Test and SPE to do so. To do so, open the {{:sift_outlier_detection.png?20}} **Outlier Detection Using PCA** dropdown on the toolbar, and select "Mahalanobis Distance Test and SPE".
  
-A pop-up window will appear, allowing you to customize some features of the Mahalanobis Distance Test calculation. For this first test, we should set the parameters as below (in the image):+A pop-up window will appear, allowing you to customize some features for both test calculations. For this test, we should set the parameters as below (in the image):
  
 {{:Sift_Mahalanobis_Tutorial_MahDialog1.png | Mahalanobis Distance Test Dialog}} {{:Sift_Mahalanobis_Tutorial_MahDialog1.png | Mahalanobis Distance Test Dialog}}
Line 81: Line 89:
 We choose to use 2 PCs, as the workspace scores are presented in 2 dimensions. We choose to use 2 PCs, as the workspace scores are presented in 2 dimensions.
  
-The Outlier alpha value is a tuneable parameter within the Mahalanobis Distance Calculation, which will determine the distance threshold for a data point to be classified as an inlier/outlier. The alpha value corresponds to the statistical significance of the chi-square distribution being compared to (i.e. a lower alpha value corresponds to a higher threshold needed to be an outlier, but more certainty in the results).+The Outlier alpha value is a tuneable parameter within the Mahalanobis Distance Calculation, which will determine the distance threshold for a data point to be classified as an inlier/outlier. The alpha value corresponds to the statistical significance of the chi-square distribution being compared to (i.e. a lower alpha value corresponds to a higher threshold needed to be an outlier, but more certainty in the results). Within SPE, the alpha parameter is similarly used, but corresponds to a non-central chi-square distribution instead.
  
 Upon running the Mahalanobis Distance Test, you should see the following datapoints highlighted (note that we rescaled the PC1 axis to have equivalent spacing to the PC2 axis): Upon running the Mahalanobis Distance Test, you should see the following datapoints highlighted (note that we rescaled the PC1 axis to have equivalent spacing to the PC2 axis):
Line 89: Line 97:
 We can immediately see that there are outliers whose (rough) Euclidean distance is much smaller than some of the inliers (ex: points near the PC1 axis at roughly +/-500 in the PC2 axis are closer than those along the PC2 axis at roughly +/-1000 in the PC1 axis). Using a normal distance measure would identify these differently, but because we used the Mahalanobis Distance Test, we were able to identify those that do not follow the general trend of the data. We can immediately see that there are outliers whose (rough) Euclidean distance is much smaller than some of the inliers (ex: points near the PC1 axis at roughly +/-500 in the PC2 axis are closer than those along the PC2 axis at roughly +/-1000 in the PC1 axis). Using a normal distance measure would identify these differently, but because we used the Mahalanobis Distance Test, we were able to identify those that do not follow the general trend of the data.
  
 +Upon running the SPE Test, you should see the following datapoints highlighted:
  
 +{{:Sift_Mahalanobis_Tutorial_Results2.png | SPE Test Combined Results}}
  
 +We notice that the outliers are seemingly random, some far away from the origin, while others are well within the 2D cluster. This is because SPE measures the reconstruction error! And not error within the lower dimensional representation. We can clearly see this through the PC Reconstruction page, if we plot one of the identified outliers (Sub04::OG_LA_run01.c3d::Frames(419,529)):
 +
 +{{:sift_mahalanobis_tutorial_reconstruction.png | PCA Reconstruction Results}}
 +
 +We can see there are some very clear errors in comparison to the ground truth! If you plot an inlier, you would see significantly less errors in this projection.
sift/tutorials/outlier_detection_with_pca.1724859784.txt.gz · Last modified: 2024/08/28 15:43 by wikisysop