sift:tutorials:outlier_detection_with_pca
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
sift:tutorials:outlier_detection_with_pca [2024/07/17 15:44] – created sgranger | sift:tutorials:outlier_detection_with_pca [2024/08/28 17:54] (current) – [Mahalanobis Distance and SPE Tests] wikisysop | ||
---|---|---|---|
Line 5: | Line 5: | ||
==== Data ==== | ==== Data ==== | ||
- | For this tutorial, we will be examining some [[https:// | + | For this tutorial, we will be examining some [[https:// |
- | {{: | + | {{: |
- | The Load Page | + | |
- | + | ||
- | + | ||
- | {{: | + | |
- | The Explore Page | + | |
+ | {{: | ||
If you are having trouble with the above instructions, | If you are having trouble with the above instructions, | ||
- | |||
- | \\ | ||
- | |||
==== Local Outlier Factor ==== | ==== Local Outlier Factor ==== | ||
Line 24: | Line 17: | ||
[[Sift: | [[Sift: | ||
- | In Sift, LOF is built upon the [[Sift: | + | In Sift, LOF is built upon the [[Sift: |
+ | * HipAngle_Z selected (and all workspaces) | ||
+ | * 4 PCs calculated | ||
+ | * "Use Workspace Mean" unchecked | ||
+ | * Named " | ||
After calculating the PCA Results, the Workspace Scores on the analyse page should look as follows (note that the points are coloured by group. If they were coloured by workspace, you would see many of these clusters correspond to workspaces): | After calculating the PCA Results, the Workspace Scores on the analyse page should look as follows (note that the points are coloured by group. If they were coloured by workspace, you would see many of these clusters correspond to workspaces): | ||
- | {{: | + | {{: |
- | Hip Angle Z PCA Results | + | |
- | + | ||
- | + | ||
- | \\ | + | |
We can see several distinct clusters, in varying positions (and none of which are located at the origin of the plot!). This could cause issues if we were to use " | We can see several distinct clusters, in varying positions (and none of which are located at the origin of the plot!). This could cause issues if we were to use " | ||
- | Whichever the situation may be, we are interested in finding outliers in our data, and making a decision upon finding them. We will start by running a LOF calculation. To do so, open the {{: | + | Whichever the situation may be, we are interested in finding outliers in our data, and making a decision upon finding them. We will start by running a LOF calculation. To do so, open the {{: |
A pop-up window will appear, allowing you to customize some features of the LOF calculation. For this first test, we should set the parameters as below (in the image): | A pop-up window will appear, allowing you to customize some features of the LOF calculation. For this first test, we should set the parameters as below (in the image): | ||
- | {{: | + | {{: |
- | LOF Dialog | + | |
- | + | ||
- | + | ||
- | \\ | + | |
For now, we are looking for outliers from within the entire PCA analysis. We could partition it such that points only calculate their LOF against points in their own group or workspace (which we will show later), but for now we want to look at the whole graph as one, and as such we have selected " | For now, we are looking for outliers from within the entire PCA analysis. We could partition it such that points only calculate their LOF against points in their own group or workspace (which we will show later), but for now we want to look at the whole graph as one, and as such we have selected " | ||
Line 58: | Line 45: | ||
Upon running the LOF analysis, you should see several datapoints highlighted: | Upon running the LOF analysis, you should see several datapoints highlighted: | ||
- | {{: | + | {{: |
- | LOF Combined Results | + | |
- | + | ||
- | + | ||
- | \\ | + | |
The highlighted points generally follow what we were hoping to see, any points outside of local clusters are identified as outliers. We see that inliers are determined as such regardless of their cluster density (ex the cluster in the top left is much less dense than others, but is still a valid cluster). We can observe that many of these traces (in the explore page) do not necessarily follow the same trends as the rest of the data, and conclude that they are outliers. | The highlighted points generally follow what we were hoping to see, any points outside of local clusters are identified as outliers. We see that inliers are determined as such regardless of their cluster density (ex the cluster in the top left is much less dense than others, but is still a valid cluster). We can observe that many of these traces (in the explore page) do not necessarily follow the same trends as the rest of the data, and conclude that they are outliers. | ||
- | {{: | + | {{: |
- | Explore Page Results | + | |
- | + | ||
- | + | ||
- | \\ | + | |
Because Sift has automatically selected the values identified to be outliers, we can easily exclude them on the explore page (if we didn't already do so through the LOF dialog). Simply right click on the explore page graph, and select exclude-> | Because Sift has automatically selected the values identified to be outliers, we can easily exclude them on the explore page (if we didn't already do so through the LOF dialog). Simply right click on the explore page graph, and select exclude-> | ||
Line 80: | Line 57: | ||
Another way we might want to examine the data is by comparing the data to only their own workspace. Maybe one of the traces was abnormal for that person (and should be removed), but it looks similar to that of another workspace. With combined groups this might not be identified as an outlier. We thus alter the LOF dialog to be as so: | Another way we might want to examine the data is by comparing the data to only their own workspace. Maybe one of the traces was abnormal for that person (and should be removed), but it looks similar to that of another workspace. With combined groups this might not be identified as an outlier. We thus alter the LOF dialog to be as so: | ||
- | {{: | + | {{: |
- | LOF Dialog | + | |
- | + | ||
- | + | ||
- | \\ | + | |
Note that we have left the other parameters the same, as we still think k=20 is a reasonable number of neighbors. Upon running this analysis we see a slightly different section of datapoints chosen (note that the graph is coloured by workspace now as opposed to group): | Note that we have left the other parameters the same, as we still think k=20 is a reasonable number of neighbors. Upon running this analysis we see a slightly different section of datapoints chosen (note that the graph is coloured by workspace now as opposed to group): | ||
- | {{: | + | {{: |
- | LOF Workspace Results | + | |
- | + | ||
- | + | ||
- | \\ | + | |
Here we notice that (among others) several points in the bottom left side are selected as outliers, and several points that were initially identified as outliers in the top right no longer are. The reason for both of these is simply that they are/ | Here we notice that (among others) several points in the bottom left side are selected as outliers, and several points that were initially identified as outliers in the top right no longer are. The reason for both of these is simply that they are/ | ||
- | ==== Mahalanobis Distance | + | ==== Mahalanobis Distance |
- | Automatically finding outliers saves time and effort, and can lead to more objective results, if it is done based on some criteria or test. There are very simple methods that could be employed, like finding the data that is farthest away from the mean and excluding some portion of those, or those that are further than some distance away. As we see in the [[Sift: | + | Automatically finding outliers saves time and effort, and can lead to more objective results, if it is done based on some criteria or test. There are very simple methods that could be employed, like finding the data that is farthest away from the mean and excluding some portion of those, or those that are further than some distance away. As we see in the [[Sift: |
- | In Sift, the Mahalanobis Distance Test is also built upon the PCA module, to find outliers in the PC workspace scores. As such, we will need to create a PCA analysis. To show the benefits of Mahalanobis Distance Test, we will be using the group " | + | In Sift, both methods |
+ | * HipPower_X selected (and all workspaces) | ||
+ | * 4 PCs calculated | ||
+ | * "Use Workspace Mean" unchecked | ||
+ | * Named " | ||
After calculating the PCA Results, the Workspace Scores on the analyse page should look as follows (note that the points are coloured by group. If they were coloured by workspace, you would see many of these clusters correspond to workspaces): | After calculating the PCA Results, the Workspace Scores on the analyse page should look as follows (note that the points are coloured by group. If they were coloured by workspace, you would see many of these clusters correspond to workspaces): | ||
- | {{: | + | {{: |
- | Hip Power X PCA Results | + | |
+ | We can clearly see an issue with using Euclidean Distance: while the dimensions do not covary (as PCA creates an independent basis), the PC1 clearly has a larger variance that should be accounted for. We are interested in automatically and algorithmic-ally removing the outliers that don't follow our general observations, | ||
- | \\ | + | A pop-up window will appear, allowing you to customize some features |
- | + | ||
- | + | ||
- | We can clearly see an issue with using Euclidean Distance: while the dimensions do not covary (as PCA creates an independent basis), the PC1 clearly has a larger variance that should be accounted for. We are interested in automatically and algorithmic-ally removing the outliers that don't follow our general observations, | + | |
- | + | ||
- | A pop-up window will appear, allowing you to customize some features | + | |
- | + | ||
- | {{: | + | |
- | Mahalanobis Distance Test Dialog | + | |
- | + | ||
- | + | ||
- | \\ | + | |
+ | {{: | ||
For now, we are looking for outliers from within the entire PCA analysis. We could partition it such that points only calculate their LOF against points in their own group or workspace (which we will show later), but for now we want to look at the whole graph as one, and as such we have selected " | For now, we are looking for outliers from within the entire PCA analysis. We could partition it such that points only calculate their LOF against points in their own group or workspace (which we will show later), but for now we want to look at the whole graph as one, and as such we have selected " | ||
Line 128: | Line 89: | ||
We choose to use 2 PCs, as the workspace scores are presented in 2 dimensions. | We choose to use 2 PCs, as the workspace scores are presented in 2 dimensions. | ||
- | The Outlier alpha value is a tuneable parameter within the Mahalanobis Distance Calculation, | + | The Outlier alpha value is a tuneable parameter within the Mahalanobis Distance Calculation, |
Upon running the Mahalanobis Distance Test, you should see the following datapoints highlighted (note that we rescaled the PC1 axis to have equivalent spacing to the PC2 axis): | Upon running the Mahalanobis Distance Test, you should see the following datapoints highlighted (note that we rescaled the PC1 axis to have equivalent spacing to the PC2 axis): | ||
- | {{: | + | {{: |
- | Mahalanobis Distance Test Combined Results | + | |
+ | We can immediately see that there are outliers whose (rough) Euclidean distance is much smaller than some of the inliers (ex: points near the PC1 axis at roughly +/-500 in the PC2 axis are closer than those along the PC2 axis at roughly +/-1000 in the PC1 axis). Using a normal distance measure would identify these differently, | ||
- | \\ | + | Upon running the SPE Test, you should see the following datapoints highlighted: |
+ | {{: | ||
- | We can immediately see that there are outliers | + | We notice |
+ | {{: | ||
+ | We can see there are some very clear errors in comparison to the ground truth! If you plot an inlier, you would see significantly less errors in this projection. |
sift/tutorials/outlier_detection_with_pca.1721231097.txt.gz · Last modified: 2024/07/17 15:44 by sgranger