Lagspike saga

Announcements about major changes in Haven & Hearth.

Lagspike saga

Postby loftar » Sat Mar 02, 2019 3:01 am

Since the current "occasional lag-spike" situation seems less than completely trivial to debug and fix, I figured I might have this thread to provide updates.

Background for those who are not aware: We moved the server to new hardware the other day to remedy the previous lag problem by introducing SSDs into the system. That did indeed seem to fix the previous lag problem (which was characterized by very frequent, but somewhat short lagspikes), but has introduced a new one in its stead (characterized instead by far more infrequent, but longer lagspikes). I still haven't found the root cause of the new issue, but behind the following Hide button, for those interested, is a somewhat technical explanation of what I've found so far, as of this post:

There are two SSDs in the system, which are RAIDed together in a mirror, for redundancy. One of these is acting strangely: Monitoring the I/O to it, it exhibits periods where the drive just seems to be "idle", often with just one, two or three, but sometimes a hundred or more, I/O requests in its queue that inexplicably just don't progress for several seconds. Most of these events pass by unnoticed, but sometimes some synchronous event happens that causes the game server to have to wait until the congestion clears up. Performing more I/O causes these events to occur more often.

The question, then, is what causes this, and I'm currently trying to figure out if it's a hardware or software issue. When I say that the SSD isn't processing I/O requests, what I'm actually looking at is the software I/O queue, rather than the queue of the device itself (I don't know of a way to examine that directly), as seen through /sys/block/<device>/stat on Linux. In order to get a bit closer to the problem, I've been looking at I/O traces on the device, as captured by blktrace. Let's have a look at a snippet of one of these events:
Code: Select all
259,1    5      295     4.847660114   211  A FWFS 67110928 + 3 <- (259,9) 16
259,1    5      296     4.847660726   211  Q FWFS 67110928 + 3 [md2_raid1]
259,1    5      297     4.847661121   211  G FWFS 67110928 + 3 [md2_raid1]
259,1    5      298     5.414686200     0  C WFSM 67443871 + 1 [0]
259,1    5      299     5.414687398     0  C WFSM 67443871 [0]
259,1    5      300     5.986030998   212  I WFS 67110928 + 3 [kworker/5:1H]
259,1    5      301     5.986048919   212  D WFS 67110928 + 3 [kworker/5:1H]
259,1    3      475     6.031590146   262  A   R 153082702 + 178 <- (259,9) 85971790
259,1    3      476     6.031590382   262  Q   R 153082702 + 178 [bcache_writebac]
259,1    3      477     6.031591225   262  G   R 153082702 + 178 [bcache_writebac]
259,1    3      478     6.031594879   262  D  RS 153082702 + 178 [bcache_writebac]
259,1    5      302     6.559476971     0  C WFS 67110928 + 3 [0]
259,1    5      303     6.559478355     0  C WFS 67110928 [0]

The particular trace of interest here is the write to 67110928+3, most likely a superblock update (either to the mdraid superblock, or the bcache superblock, not sure which one; I could find out, but it probably doesn't matter). There are a couple of things that stand out about it. First and most importantly, it takes 1.7 seconds to complete, which is quite over the top, especially seeing as how the drive is doing virtually no other I/O at the same time -- it's almost alone in the queue. Second, it spends most of its time, a full 1.1 seconds, not even on the hardware, but between the "G" and "I" states. Third, even once it does get to the hardware (the "D" state until the "C" state), it still spends the remaining ~0.6 seconds on there, which is still quite over the top.

Now, I'm not so intimately familiar with Linux' I/O system that I can say for sure what the "Q" and "G" states really mean, so I don't know exactly what it means for a request to remain in the "G" state for so long. It could mean this is a pure software problem (perhaps that the code is waiting for some totally unnecessary lock, or whatever), or it could be that whatever function put the request in the "G" state is waiting for some other hardware event to complete before it proceeds to the "I" state (for example, waiting for a prior I/O operation to complete until this one can safely be made, for ordering reasons), in which case it might still be a hardware problem. If it's the latter, however, then it's not clear from this trace what it's waiting on, since the only other I/O operation that completes on this device in the meantime (the write to 67443871+1) completes over a half a second prior with no activity whatsoever on the queue in between.

So currently, there are some things that speak for this being either a software or a hardware problem. On the software side: The request spends most of its time just in the software side of things, and also there's no obvious hardware event that completes for it to continue. On the hardware side, the write still takes quite long once it does reach the hardware, and it's also surrounded by other requests that take long in the hardware (both the write to 67443871+1 and the later read to 153082702+178). Also seeming to indicate a hardware problem is that fact that only one of the SSDs is exhibiting this problem, while I can't see anything like it on the other one. I nevertheless figure that even that might just be that mdraid treats its two legs slightly differently in some subtle manner.

However...


My current observation is that there's some kind of weirdness with the kernel, however, which makes it a bit unwieldy to make certain kinds of observations, so my first plan of action is to upgrade the kernel from the current Debian 9 stock kernel, to the backport of the Debian 10 stock kernel (which allegedly fixes said issue). Perhaps, if the whole issue actually is a software issue, and with a little bit of luck, then that in itself might "accidentally" solve the problem.

Of course, to do that, I need to reboot the server. Should be a quick operation, just don't be surprised by the website and forums being down for a minute or two (or possibly more if something goes strange about it).
"Object-oriented design is the roman numerals of computing." -- Rob Pike
User avatar
loftar
 
Posts: 8926
Joined: Fri Apr 03, 2009 7:05 am

Re: Lagspike saga

Postby Sevenless » Sat Mar 02, 2019 3:02 am

Neat
Lucky: haven is so quirky
Lucky: can be so ugly, can be so heartwarming
Sevenless: it is life

The Art of Herding
W15 Casting Rod Cheatsheet
Explanation of the logic behind the cooking system
User avatar
Sevenless
 
Posts: 7292
Joined: Fri Mar 04, 2011 3:55 am
Location: Canada

Re: Lagspike saga

Postby loftar » Sat Mar 02, 2019 3:03 am

Oh, and since someone is going to ask, the SSDs in the system are Toshiba drives, model number "THNSN5512GPU7", both of them.
"Object-oriented design is the roman numerals of computing." -- Rob Pike
User avatar
loftar
 
Posts: 8926
Joined: Fri Apr 03, 2009 7:05 am

Re: Lagspike saga

Postby TheNater » Sat Mar 02, 2019 3:10 am

Good luck!

I know you mentioned it's only happening on 1 drive on the Raid, but maybe you could physically swap the drives (AB to BA, this would help figure out if it's hardware or sofware). That is, of course if the update to the kernel doesn't help.
Last edited by TheNater on Sat Mar 02, 2019 3:11 am, edited 1 time in total.
TheNater
 
Posts: 44
Joined: Mon Jan 24, 2011 12:10 am

Re: Lagspike saga

Postby Kaios » Sat Mar 02, 2019 3:10 am

loftar wrote:model number "THNSN5512GPU7", both of them.


that's racist
User avatar
Kaios
 
Posts: 8703
Joined: Fri Jul 30, 2010 2:14 am

Re: Lagspike saga

Postby Ysh » Sat Mar 02, 2019 3:13 am

Kaios wrote:
loftar wrote:model number "THNSN5512GPU7", both of them.


that's racist

What?
Kaios wrote:Spice Girls are integral to understanding Ysh's thought process when communicating, duly noted.

I have become victory of very nice Jordan Coles Contest! Enjoy my winning submit here if it pleasures you.
User avatar
Ysh
 
Posts: 5953
Joined: Sun Jan 31, 2010 4:43 am
Location: Chatting some friends on forum

Re: Lagspike saga

Postby loftar » Sat Mar 02, 2019 3:14 am

Great shitposting, guys. A+.

Rebooting in two minutes.
"Object-oriented design is the roman numerals of computing." -- Rob Pike
User avatar
loftar
 
Posts: 8926
Joined: Fri Apr 03, 2009 7:05 am

Re: Lagspike saga

Postby Sprymazi » Sat Mar 02, 2019 3:14 am

"Mandatory comment on how we NEED 3-4 months free premium for 5 hours of downtime"
Sprymazi
 
Posts: 45
Joined: Mon Jan 22, 2018 11:11 pm

Re: Lagspike saga

Postby kirion » Sat Mar 02, 2019 3:27 am

Hopefully it fix the lagspikes. Everything else is butter smooth.
User avatar
kirion
 
Posts: 373
Joined: Sat Jul 31, 2010 11:45 pm

Re: Lagspike saga

Postby Ejnekor » Sat Mar 02, 2019 3:29 am

Goodluck, guys! I hope it will work. Lagspikes were kind of annoying lately...
Ejnekor
 
Posts: 133
Joined: Thu Mar 11, 2010 4:41 pm

Next

Return to Announcements

Who is online

Users browsing this forum: Python-Requests [Bot] and 17 guests