Lagspike saga

Announcements about major changes in Haven & Hearth.

Re: Lagspike saga

Postby loftar » Sat Mar 02, 2019 6:16 pm

Granger wrote:Well, as long as the HDD don't report similar...

No, they're quite normal, at around 35 °C.

Granger wrote:Did it read it right that the issue has moved from the one NVMe (where you noticed it) to the other and is still there at the moment?

Indeed.
"Object-oriented design is the roman numerals of computing." -- Rob Pike
User avatar
loftar
 
Posts: 8926
Joined: Fri Apr 03, 2009 7:05 am

Re: Lagspike saga

Postby Granger » Sat Mar 02, 2019 6:19 pm

Have you looked at them with the nvme tool available from the nvme-cli package? Could you post the output for both?
Interesting stuff could be in 'Thermal Throttle Status' .

Also, according to https://www.percona.com/blog/2017/02/09 ... sh-health/
Warning Temperature Time/Critical Temperature Time. The time in minutes a device operated above a warning or critical temperature. It should be zeroes.
⁎ Mon Mar 22, 2010 ✝ Thu Jan 23, 2020
User avatar
Granger
 
Posts: 9263
Joined: Mon Mar 22, 2010 2:00 pm

Re: Lagspike saga

Postby loftar » Sat Mar 02, 2019 6:22 pm

Granger wrote:Have you looked at them with the nvme tool available from the nvme-cli package? Could you post the output for both?

I have, but I haven't found anything that obviously stands out. Is there any particular subcommand you have in mind? The error log is empty, if that's what you have in mind.
"Object-oriented design is the roman numerals of computing." -- Rob Pike
User avatar
loftar
 
Posts: 8926
Joined: Fri Apr 03, 2009 7:05 am

Re: Lagspike saga

Postby Granger » Sat Mar 02, 2019 6:23 pm

Basically NVMe SMART attributes (smartctl -A, for smartmontools>=6.5).
⁎ Mon Mar 22, 2010 ✝ Thu Jan 23, 2020
User avatar
Granger
 
Posts: 9263
Joined: Mon Mar 22, 2010 2:00 pm

Re: Lagspike saga

Postby loftar » Sat Mar 02, 2019 6:26 pm

Granger wrote:Basically NVMe SMART attributes (smartctl -A, for smartmontools>=6.5).

I don't see any real difference at all between startctl -a and nvme smart-log for these drives; they contain the exact same values. The drives don't seem to support nvme smart-log-add.
"Object-oriented design is the roman numerals of computing." -- Rob Pike
User avatar
loftar
 
Posts: 8926
Joined: Fri Apr 03, 2009 7:05 am

Re: Lagspike saga

Postby shubla » Sat Mar 02, 2019 8:03 pm

loftar wrote:

Granger wrote:I would also check if the temperature reading in the smart output is correct, 57°C seems a bit high to me.

I thought so too, and asked Hetzner about it. They said it's normal. >.>

Should've put your servers in Finland, its a lot cooler up here!
Image
I'm not sure that I have a strong argument against sketch colors - Jorb, November 2019
http://i.imgur.com/CRrirds.png?1
Join the moderated unofficial discord for the game! https://discord.gg/2TAbGj2
Purus Pasta, The Best Client
User avatar
shubla
 
Posts: 13043
Joined: Sun Nov 03, 2013 11:26 am
Location: Finland

Re: Lagspike saga

Postby Granger » Mon Mar 04, 2019 12:43 am

As a stab into the dark: do you have a powersave CPU gouvernor running, or any tool that adjusts CPU clock speed?
If so it could be a try to disable these and manually set the CPU to maximum clock.
Background is that I encountered a system once that, whenever the clock speed changed, stalled IOs.

Also, in case you have mounted the filesystems with TRIM/DISCARD support you could try to turn that off (and replace it with a daily/weekly fstrim by cron, in case the online trim causes the stalls).

In the end it would be beneficial to locate the source of the IO floods that, in case I understood you correctly, started to happen with world 11.
⁎ Mon Mar 22, 2010 ✝ Thu Jan 23, 2020
User avatar
Granger
 
Posts: 9263
Joined: Mon Mar 22, 2010 2:00 pm

Re: Lagspike saga

Postby Grog » Mon Mar 04, 2019 1:30 am

Cave decay was implemented in w11, right?
Favourite thread: viewtopic.php?f=9&t=3388
User avatar
Grog
 
Posts: 2730
Joined: Mon Feb 08, 2010 11:42 pm
Location: Germany

Re: Lagspike saga

Postby loftar » Mon Mar 04, 2019 1:42 am

Granger wrote:As a stab into the dark: do you have a powersave CPU gouvernor running, or any tool that adjusts CPU clock speed?
If so it could be a try to disable these and manually set the CPU to maximum clock.
Background is that I encountered a system once that, whenever the clock speed changed, stalled IOs.

It used the powersave governor by default, but I've already set that to performance, since apparently on Skylake the powersave governor caused the CPU to turbo much less. (On the previous server, with a Haswell Xeon, it turboed almost constantly regardless of cpufreq governor). That being said, even the performance governor doesn't keep the CPU at full frequency at all times, but if it would be as you say that frequency changes were stalling I/O, then I find that explanation unlikely because the frequency changes literally constantly (I can check the CPU frequency as frequently as I want to, and it's still basically changing on every check), so if that were the case it seems I would get no I/O done ever.

Granger wrote:Also, in case you have mounted the filesystems with TRIM/DISCARD support you could try to turn that off (and replace it with a daily/weekly fstrim by cron, in case the online trim causes the stalls).

I have in fact mounted them without TRIM/DISCARD, because to my understanding, using TRIM doesn't make a lot of sense on bcache (as the cache device is constantly kept full anyway).

EDIT: If anything, that makes me start to wonder if that could actually be the problem, since the FTL might have to spend time garbage collecting. Perhaps it would have worked better if I had reserved 100 GB or so of permanently unused space on the drives. Just testing whether that's the case would take a fair bit of downtime in order to recreate the RAID device and all, though. Hmm. Does anyone have enough experience with NVMe drives to say how likely that is to be the case (and, if so, how much space should be reserved)? My own knowledge of them is mostly theoretical.
"Object-oriented design is the roman numerals of computing." -- Rob Pike
User avatar
loftar
 
Posts: 8926
Joined: Fri Apr 03, 2009 7:05 am

Re: Lagspike saga

Postby Granger » Mon Mar 04, 2019 7:46 am

flash_vol_create in /sys/fs/bcache/<cset-uuid> could, according to bcache documentation, be a way to emulate overprovision ex post facto.

Thinking a bit further about online TRIM on the backed filesystem: could also be an answer, as it would tell bcache which blocks got free'd. Without it bcache might see the cache running full and doing a purge of multiple erase blocks all over the cache, the latencies of a bunch of small discards in parallel might add up to the stalls you see... Depends on the discard setting in /sys/block/<cdev>/bcache though.
⁎ Mon Mar 22, 2010 ✝ Thu Jan 23, 2020
User avatar
Granger
 
Posts: 9263
Joined: Mon Mar 22, 2010 2:00 pm

PreviousNext

Return to Announcements

Who is online

Users browsing this forum: Python-Requests [Bot] and 11 guests