Lagspike saga

by **dageir** » Sat Mar 02, 2019 7:04 am

Did the problems appear or get worse after the introduction of world 11? Is this world somehow distinct from the others? (I know it is distinct, but is it distinct in any way affecting performance?)

by **Granger** » Sat Mar 02, 2019 8:17 am

I have some ideas, to avoid too much red herrings I kindly ask loftar to post the exact kernel version, what schedulers are active for the drives and the output of smartctl -a for the drives.

by **Vassteel** » Sat Mar 02, 2019 8:25 am

Not a IT guy but ive heard turning it off and turning it back on works

by **MagicManICT** » Sat Mar 02, 2019 11:34 am

Vassteel wrote:Not a IT guy but ive heard turning it off and turning it back on works

I think they did that Thursday?

by **eliminator** » Sat Mar 02, 2019 12:42 pm

loftar wrote:Rebooting in two minutes.

Good thing my steelbox fueling script triggered just before the reboot.

by **The_Blode** » Sat Mar 02, 2019 4:35 pm

i have the sinking feeling this is going to be one of those problems that gets 'solved' with no idea as to what fixed it, and then it'll return in a year or so for no reason
this is what my experience with programming and network nonsense tells me. maybe that sort of thing only happens to me

by **loftar** » Sat Mar 02, 2019 5:08 pm

Granger wrote:I have some ideas, to avoid too much red herrings I kindly ask loftar to post the exact kernel version, what schedulers are active for the drives and the output of smartctl -a for the drives.

The previous kernel was Debian's 4.9.0-8, the new one is their 4.19.0-0.bpo.2. The NVMe drives are using the noop scheduler ("none", that is), and the HDDs are using mq-deadline. smartctl -a is as follows:

Code: Select all: $ sudo smartctl -a /dev/nvme0 smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.19.0-0.bpo.2-amd64] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Number: THNSN5512GPU7 TOSHIBA Serial Number: 273S10WSTUHV Firmware Version: 57GA4103 PCI Vendor/Subsystem ID: 0x1179 IEEE OUI Identifier: 0x00080d Controller ID: 0 Number of Namespaces: 1 Namespace 1 Size/Capacity: 512,110,190,592 [512 GB] Namespace 1 Formatted LBA Size: 512 Local Time is: Sat Mar 2 17:06:13 2019 CET Firmware Updates (0x02): 1 Slot Optional Admin Commands (0x0007): Security Format Frmw_DL Optional NVM Commands (0x000e): Wr_Unc DS_Mngmt Wr_Zero Warning Comp. Temp. Threshold: 78 Celsius Critical Comp. Temp. Threshold: 82 Celsius Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 6.00W - - 0 0 0 0 0 0 1 + 2.40W - - 1 1 1 1 0 0 2 + 1.90W - - 2 2 2 2 0 0 3 - 0.1600W - - 3 3 3 3 1000 1000 4 - 0.0120W - - 4 4 4 4 5000 35000 5 - 0.0060W - - 5 5 5 5 100000 110000 Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 2 1 - 4096 0 1 === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff) Critical Warning: 0x00 Temperature: 57 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 13% Data Units Read: 130,882,588 [67.0 TB] Data Units Written: 71,648,646 [36.6 TB] Host Read Commands: 5,958,767,568 Host Write Commands: 687,281,425 Controller Busy Time: 20,611 Power Cycles: 23 Power On Hours: 13,592 Unsafe Shutdowns: 6 Media and Data Integrity Errors: 0 Error Information Log Entries: 0 Warning Comp. Temperature Time: 11 Critical Comp. Temperature Time: 0 Temperature Sensor 1: 57 Celsius Error Information (NVMe Log 0x01, max 128 entries) No Errors Logged $ sudo smartctl -a /dev/nvme1 smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.19.0-0.bpo.2-amd64] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Number: THNSN5512GPU7 TOSHIBA Serial Number: 273S10WDTUHV Firmware Version: 57GA4103 PCI Vendor/Subsystem ID: 0x1179 IEEE OUI Identifier: 0x00080d Controller ID: 0 Number of Namespaces: 1 Namespace 1 Size/Capacity: 512,110,190,592 [512 GB] Namespace 1 Formatted LBA Size: 512 Local Time is: Sat Mar 2 17:06:15 2019 CET Firmware Updates (0x02): 1 Slot Optional Admin Commands (0x0007): Security Format Frmw_DL Optional NVM Commands (0x000e): Wr_Unc DS_Mngmt Wr_Zero Warning Comp. Temp. Threshold: 78 Celsius Critical Comp. Temp. Threshold: 82 Celsius Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 6.00W - - 0 0 0 0 0 0 1 + 2.40W - - 1 1 1 1 0 0 2 + 1.90W - - 2 2 2 2 0 0 3 - 0.1600W - - 3 3 3 3 1000 1000 4 - 0.0120W - - 4 4 4 4 5000 35000 5 - 0.0060W - - 5 5 5 5 100000 110000 Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 2 1 - 4096 0 1 === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff) Critical Warning: 0x00 Temperature: 51 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 17% Data Units Read: 120,388,671 [61.6 TB] Data Units Written: 72,558,074 [37.1 TB] Host Read Commands: 5,928,728,026 Host Write Commands: 662,031,292 Controller Busy Time: 20,684 Power Cycles: 22 Power On Hours: 13,581 Unsafe Shutdowns: 9 Media and Data Integrity Errors: 0 Error Information Log Entries: 0 Warning Comp. Temperature Time: 37 Critical Comp. Temperature Time: 0 Temperature Sensor 1: 51 Celsius Error Information (NVMe Log 0x01, max 128 entries) No Errors Logged

by **Granger** » Sat Mar 02, 2019 6:00 pm

I would start with setting the noop scheduler for the hdd for a quick and on the fly test, just to make sure that the problem isn't somewhere in that level of the IO stack and the stall you see is caused by bcache waiting for a hdd.

I would also check if the temperature reading in the smart output is correct, 57°C seems a bit high to me.

by **loftar** » Sat Mar 02, 2019 6:12 pm

Granger wrote:I would start with setting the noop scheduler for the hdd for a quick and on the fly test, just to make sure that the problem isn't somewhere in that level of the IO stack and the stall you see is caused by bcache waiting for a hdd.

I doubt the correlation between lagspikes and finding the SSD latencies described in the OP is a coincidence, but sure, it certainly can't hurt to try.

Granger wrote:I would also check if the temperature reading in the smart output is correct, 57°C seems a bit high to me.

I thought so too, and asked Hetzner about it. They said it's normal. >.>

by **Granger** » Sat Mar 02, 2019 6:16 pm

loftar wrote:
Granger wrote:I would also check if the temperature reading in the smart output is correct, 57°C seems a bit high to me.

I thought so too, and asked Hetzner about it. They said it's normal. >.>

Well, as long as the HDD don't report similar...

Did it read it right that the issue has moved from the one NVMe (where you noticed it) to the other and is still there at the moment?

Lagspike saga

Re: Lagspike saga

Re: Lagspike saga

Re: Lagspike saga

Re: Lagspike saga

Re: Lagspike saga

Re: Lagspike saga

Re: Lagspike saga

Re: Lagspike saga

Re: Lagspike saga

Re: Lagspike saga

Who is online