Disable Seek & Read error rate attribute analysis. Causes issues with Seagate Ironwolf drives.

Added documentation.
3 years ago · ecf7a447a7
parent f8e61af2f9
commit ecf7a447a7
2 changed files with 40 additions and 101 deletions
--- a/docs/TROUBLESHOOTING_DEVICE_COLLECTOR.md
+++ b/docs/TROUBLESHOOTING_DEVICE_COLLECTOR.md
@ -182,9 +182,16 @@ If you hover over the "failed" label beside an attribute, Scrutiny will tell you

 ### Device failed but Smart & Scrutiny passed

-Device SMART results are the source of truth for Scrutiny, however we don't just take into account the current SMART results, but also histrical analysis of a disk.
+Device SMART results are the source of truth for Scrutiny, however we don't just take into account the current SMART results, but also historical analysis of a disk.
 This means that if a device is marked as failed at any point in its history, it will continue to be stored in the database as failed until the device is removed (or status is reset -- see below).

+In some cases, this historical failure may have been due to attribute analysis/thresholds that have since been relaxed:
+
+- NVME - Numb Error Log Entries (v0.4.7)
+- ATA - Power Cycle Count (v0.4.7)
+- ATA - Read Error Rate (v0.4.13)
+- ATA - Seek Error Rate (v0.4.13)
+
 If you'd like to reset the status of a disk (to healthy) and allow the next run of the collector to determine the actual status, you can run the following command:

 ```bash
@ -204,9 +211,41 @@ UPDATE devices SET device_status = null
 .exit
 ```

+### Seagate Drives Failing
+
+As thoroughly discussed in #255, Seagate (Ironwolf & others) drives are almost always marked as failed by Scrutiny. 
+
+> The `Seek Error Rate` & `Read Error Rate` attribute raw values are typically very high, and the 
+> normalised values (Current / Worst / Threshold) are usually quite low. Despite this, the numbers in most cases are perfectly OK
+> 
+> The anxiety arises because we intuitively expect that the normalised values should reflect a "health" score, with 
+> 100 being the ideal value. Similarly, we would expect that the raw values should reflect an error count, in 
+> which case a value of 0 would be most desirable. However, Seagate calculates and applies these attribute values 
+> in a counterintuitive way.
+> 
+> http://www.users.on.net/~fzabkar/HDD/Seagate_SER_RRER_HEC.html
+
+Some analysis has been done which shows that Seagate drives break the common SMART conventions, which also causes Scrutiny's
+comparison against BackBlaze data to detect these drives as failed. 
+
+**So what's the Solution?**
+
+After taking a look at the BackBlaze data for the relevant Attributes (`Seek Error Rate` & `Read Error Rate`), I've decided
+to disable Scrutiny analysis for them. Both are non-critical, and have low-correlation with failure.
+
+> Please note: SMART failures for these attributes will still cause the drive to be marked as failed. Only BackBlaze analysis has been disabled
+
+If this is effecting your drives, you'll need to do the following:

+1. Upgrade to v0.4.13+
+2. Reset your drive status using the SQLite script in [#device-failed-but-smart--scrutiny-passed](https://github.com/AnalogJ/scrutiny/blob/master/docs/TROUBLESHOOTING_DEVICE_COLLECTOR.md#device-failed-but-smart--scrutiny-passed)
+3. Wait for (or manually start) the collector.

+If you'd like to learn more about how the Seagate Ironwolf SMART attributes work under the hood, and how they differ from
+other drives, please read the following:

+- http://www.users.on.net/~fzabkar/HDD/Seagate_SER_RRER_HEC.html
+- https://www.truenas.com/community/threads/seagate-ironwolf-smart-test-raw_read_error_rate-seek_error_rate.68634/

 ## Hub & Spoke model, with multiple Hosts.

--- a/webapp/backend/pkg/thresholds/ata_attribute_metadata.go
+++ b/webapp/backend/pkg/thresholds/ata_attribute_metadata.go
@ -36,56 +36,6 @@ var AtaMetadata = map[int]AtaAttributeMetadata{
 		Ideal:       ObservedThresholdIdealLow,
 		Critical:    false,
 		Description: "(Vendor specific raw value.) Stores data related to the rate of hardware read errors that occurred when reading data from a disk surface. The raw value has different structure for different vendors and is often not meaningful as a decimal number.",
-		ObservedThresholds: []ObservedThreshold{
-			{
-				Low:               80,
-				High:              95,
-				AnnualFailureRate: 0.8879749768303985,
-				ErrorInterval:     []float64{0.682344353388663, 1.136105732920724},
-			},
-			{
-				Low:               95,
-				High:              110,
-				AnnualFailureRate: 0.034155719633986996,
-				ErrorInterval:     []float64{0.030188482024981093, 0.038499386872354435},
-			},
-			{
-				Low:               110,
-				High:              125,
-				AnnualFailureRate: 0.06390002135229157,
-				ErrorInterval:     []float64{0.05852004676110847, 0.06964160930553712},
-			},
-			{
-				Low:               125,
-				High:              140,
-				AnnualFailureRate: 0,
-				ErrorInterval:     []float64{0, 0},
-			},
-			{
-				Low:               140,
-				High:              155,
-				AnnualFailureRate: 0,
-				ErrorInterval:     []float64{0, 0},
-			},
-			{
-				Low:               155,
-				High:              170,
-				AnnualFailureRate: 0,
-				ErrorInterval:     []float64{0, 0},
-			},
-			{
-				Low:               170,
-				High:              185,
-				AnnualFailureRate: 0,
-				ErrorInterval:     []float64{0, 0},
-			},
-			{
-				Low:               185,
-				High:              200,
-				AnnualFailureRate: 0.044823775021490854,
-				ErrorInterval:     []float64{0.032022762038723306, 0.06103725943096589},
-			},
-		},
 	},
 	2: {
 		ID:          2,
@ -290,56 +240,6 @@ var AtaMetadata = map[int]AtaAttributeMetadata{
 		Ideal:       "",
 		Critical:    false,
 		Description: "(Vendor specific raw value.) Rate of seek errors of the magnetic heads. If there is a partial failure in the mechanical positioning system, then seek errors will arise. Such a failure may be due to numerous factors, such as damage to a servo, or thermal widening of the hard disk. The raw value has different structure for different vendors and is often not meaningful as a decimal number.",
-		ObservedThresholds: []ObservedThreshold{
-			{
-				Low:               58,
-				High:              76,
-				AnnualFailureRate: 0.2040131025936549,
-				ErrorInterval:     []float64{0.17032852883286412, 0.2424096283327138},
-			},
-			{
-				Low:               76,
-				High:              94,
-				AnnualFailureRate: 0.08725919610118257,
-				ErrorInterval:     []float64{0.08077138510999876, 0.09412943212007528},
-			},
-			{
-				Low:               94,
-				High:              112,
-				AnnualFailureRate: 0.01087335627722523,
-				ErrorInterval:     []float64{0.008732197944943352, 0.013380600544561905},
-			},
-			{
-				Low:               112,
-				High:              130,
-				AnnualFailureRate: 0,
-				ErrorInterval:     []float64{0, 0},
-			},
-			{
-				Low:               130,
-				High:              148,
-				AnnualFailureRate: 0,
-				ErrorInterval:     []float64{0, 0},
-			},
-			{
-				Low:               148,
-				High:              166,
-				AnnualFailureRate: 0,
-				ErrorInterval:     []float64{0, 0},
-			},
-			{
-				Low:               166,
-				High:              184,
-				AnnualFailureRate: 0,
-				ErrorInterval:     []float64{0, 0},
-			},
-			{
-				Low:               184,
-				High:              202,
-				AnnualFailureRate: 0.05316285755900475,
-				ErrorInterval:     []float64{0.03370069132942804, 0.07977038905848267},
-			},
-		},
 	},
 	8: {
 		ID:          8,