Lagspike saga

Announcements about major changes in Haven & Hearth.

Re: Lagspike saga

Postby dageir » Sat Mar 02, 2019 7:04 am

Did the problems appear or get worse after the introduction of world 11? Is this world somehow distinct from the others? (I know it is distinct, but is it distinct in any way affecting performance?)
Image
User avatar
dageir
 
Posts: 1961
Joined: Sat Mar 31, 2012 12:37 pm

Re: Lagspike saga

Postby Granger » Sat Mar 02, 2019 8:17 am

I have some ideas, to avoid too much red herrings I kindly ask loftar to post the exact kernel version, what schedulers are active for the drives and the output of smartctl -a for the drives.
⁎ Mon Mar 22, 2010 ✝ Thu Jan 23, 2020
User avatar
Granger
 
Posts: 9263
Joined: Mon Mar 22, 2010 2:00 pm

Re: Lagspike saga

Postby Vassteel » Sat Mar 02, 2019 8:25 am

Not a IT guy but ive heard turning it off and turning it back on works
jorb wrote:Stop shitposting.
User avatar
Vassteel
 
Posts: 701
Joined: Thu Aug 15, 2013 12:38 pm

Re: Lagspike saga

Postby MagicManICT » Sat Mar 02, 2019 11:34 am

Vassteel wrote:Not a IT guy but ive heard turning it off and turning it back on works

I think they did that Thursday?
Opinions expressed in this statement are the authors alone and in no way reflect on the game development values of the actual developers.
User avatar
MagicManICT
 
Posts: 18437
Joined: Tue Aug 17, 2010 1:47 am

Re: Lagspike saga

Postby eliminator » Sat Mar 02, 2019 12:42 pm

loftar wrote:Rebooting in two minutes.

Good thing my steelbox fueling script triggered just before the reboot.
User avatar
eliminator
 
Posts: 85
Joined: Fri Mar 25, 2011 8:42 pm

Re: Lagspike saga

Postby The_Blode » Sat Mar 02, 2019 4:35 pm

i have the sinking feeling this is going to be one of those problems that gets 'solved' with no idea as to what fixed it, and then it'll return in a year or so for no reason
this is what my experience with programming and network nonsense tells me. maybe that sort of thing only happens to me
User avatar
The_Blode
 
Posts: 433
Joined: Sat Oct 08, 2011 7:51 am
Location: Location: Location

Re: Lagspike saga

Postby loftar » Sat Mar 02, 2019 5:08 pm

Granger wrote:I have some ideas, to avoid too much red herrings I kindly ask loftar to post the exact kernel version, what schedulers are active for the drives and the output of smartctl -a for the drives.

The previous kernel was Debian's 4.9.0-8, the new one is their 4.19.0-0.bpo.2. The NVMe drives are using the noop scheduler ("none", that is), and the HDDs are using mq-deadline. smartctl -a is as follows:
Code: Select all
$ sudo smartctl -a /dev/nvme0
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.19.0-0.bpo.2-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       THNSN5512GPU7 TOSHIBA
Serial Number:                      273S10WSTUHV
Firmware Version:                   57GA4103
PCI Vendor/Subsystem ID:            0x1179
IEEE OUI Identifier:                0x00080d
Controller ID:                      0
Number of Namespaces:               1
Namespace 1 Size/Capacity:          512,110,190,592 [512 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Sat Mar  2 17:06:13 2019 CET
Firmware Updates (0x02):            1 Slot
Optional Admin Commands (0x0007):   Security Format Frmw_DL
Optional NVM Commands (0x000e):     Wr_Unc DS_Mngmt Wr_Zero
Warning  Comp. Temp. Threshold:     78 Celsius
Critical Comp. Temp. Threshold:     82 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.00W       -        -    0  0  0  0        0       0
 1 +     2.40W       -        -    1  1  1  1        0       0
 2 +     1.90W       -        -    2  2  2  2        0       0
 3 -   0.1600W       -        -    3  3  3  3     1000    1000
 4 -   0.0120W       -        -    4  4  4  4     5000   35000
 5 -   0.0060W       -        -    5  5  5  5   100000  110000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning:                   0x00
Temperature:                        57 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    13%
Data Units Read:                    130,882,588 [67.0 TB]
Data Units Written:                 71,648,646 [36.6 TB]
Host Read Commands:                 5,958,767,568
Host Write Commands:                687,281,425
Controller Busy Time:               20,611
Power Cycles:                       23
Power On Hours:                     13,592
Unsafe Shutdowns:                   6
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    11
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               57 Celsius

Error Information (NVMe Log 0x01, max 128 entries)
No Errors Logged

$ sudo smartctl -a /dev/nvme1
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.19.0-0.bpo.2-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       THNSN5512GPU7 TOSHIBA
Serial Number:                      273S10WDTUHV
Firmware Version:                   57GA4103
PCI Vendor/Subsystem ID:            0x1179
IEEE OUI Identifier:                0x00080d
Controller ID:                      0
Number of Namespaces:               1
Namespace 1 Size/Capacity:          512,110,190,592 [512 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Sat Mar  2 17:06:15 2019 CET
Firmware Updates (0x02):            1 Slot
Optional Admin Commands (0x0007):   Security Format Frmw_DL
Optional NVM Commands (0x000e):     Wr_Unc DS_Mngmt Wr_Zero
Warning  Comp. Temp. Threshold:     78 Celsius
Critical Comp. Temp. Threshold:     82 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.00W       -        -    0  0  0  0        0       0
 1 +     2.40W       -        -    1  1  1  1        0       0
 2 +     1.90W       -        -    2  2  2  2        0       0
 3 -   0.1600W       -        -    3  3  3  3     1000    1000
 4 -   0.0120W       -        -    4  4  4  4     5000   35000
 5 -   0.0060W       -        -    5  5  5  5   100000  110000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning:                   0x00
Temperature:                        51 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    17%
Data Units Read:                    120,388,671 [61.6 TB]
Data Units Written:                 72,558,074 [37.1 TB]
Host Read Commands:                 5,928,728,026
Host Write Commands:                662,031,292
Controller Busy Time:               20,684
Power Cycles:                       22
Power On Hours:                     13,581
Unsafe Shutdowns:                   9
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    37
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               51 Celsius

Error Information (NVMe Log 0x01, max 128 entries)
No Errors Logged
"Object-oriented design is the roman numerals of computing." -- Rob Pike
User avatar
loftar
 
Posts: 8926
Joined: Fri Apr 03, 2009 7:05 am

Re: Lagspike saga

Postby Granger » Sat Mar 02, 2019 6:00 pm

I would start with setting the noop scheduler for the hdd for a quick and on the fly test, just to make sure that the problem isn't somewhere in that level of the IO stack and the stall you see is caused by bcache waiting for a hdd.

I would also check if the temperature reading in the smart output is correct, 57°C seems a bit high to me.
⁎ Mon Mar 22, 2010 ✝ Thu Jan 23, 2020
User avatar
Granger
 
Posts: 9263
Joined: Mon Mar 22, 2010 2:00 pm

Re: Lagspike saga

Postby loftar » Sat Mar 02, 2019 6:12 pm

Granger wrote:I would start with setting the noop scheduler for the hdd for a quick and on the fly test, just to make sure that the problem isn't somewhere in that level of the IO stack and the stall you see is caused by bcache waiting for a hdd.

I doubt the correlation between lagspikes and finding the SSD latencies described in the OP is a coincidence, but sure, it certainly can't hurt to try.

Granger wrote:I would also check if the temperature reading in the smart output is correct, 57°C seems a bit high to me.

I thought so too, and asked Hetzner about it. They said it's normal. >.>
"Object-oriented design is the roman numerals of computing." -- Rob Pike
User avatar
loftar
 
Posts: 8926
Joined: Fri Apr 03, 2009 7:05 am

Re: Lagspike saga

Postby Granger » Sat Mar 02, 2019 6:16 pm

loftar wrote:
Granger wrote:I would also check if the temperature reading in the smart output is correct, 57°C seems a bit high to me.

I thought so too, and asked Hetzner about it. They said it's normal. >.>

Well, as long as the HDD don't report similar...

Did it read it right that the issue has moved from the one NVMe (where you noticed it) to the other and is still there at the moment?
⁎ Mon Mar 22, 2010 ✝ Thu Jan 23, 2020
User avatar
Granger
 
Posts: 9263
Joined: Mon Mar 22, 2010 2:00 pm

PreviousNext

Return to Announcements

Who is online

Users browsing this forum: Naylok, Python-Requests [Bot] and 163 guests