Lagspike saga

by **loftar** » Sat Mar 02, 2019 3:01 am

Since the current "occasional lag-spike" situation seems less than completely trivial to debug and fix, I figured I might have this thread to provide updates.

Background for those who are not aware: We moved the server to new hardware the other day to remedy the previous lag problem by introducing SSDs into the system. That did indeed seem to fix the previous lag problem (which was characterized by very frequent, but somewhat short lagspikes), but has introduced a new one in its stead (characterized instead by far more infrequent, but longer lagspikes). I still haven't found the root cause of the new issue, but behind the following Hide button, for those interested, is a somewhat technical explanation of what I've found so far, as of this post:

There are two SSDs in the system, which are RAIDed together in a mirror, for redundancy. One of these is acting strangely: Monitoring the I/O to it, it exhibits periods where the drive just seems to be "idle", often with just one, two or three, but sometimes a hundred or more, I/O requests in its queue that inexplicably just don't progress for several seconds. Most of these events pass by unnoticed, but sometimes some synchronous event happens that causes the game server to have to wait until the congestion clears up. Performing more I/O causes these events to occur more often.

The question, then, is what causes this, and I'm currently trying to figure out if it's a hardware or software issue. When I say that the SSD isn't processing I/O requests, what I'm actually looking at is the software I/O queue, rather than the queue of the device itself (I don't know of a way to examine that directly), as seen through /sys/block/<device>/stat on Linux. In order to get a bit closer to the problem, I've been looking at I/O traces on the device, as captured by blktrace. Let's have a look at a snippet of one of these events:

Code: Select all: 259,1 5 295 4.847660114 211 A FWFS 67110928 + 3 <- (259,9) 16 259,1 5 296 4.847660726 211 Q FWFS 67110928 + 3 [md2_raid1] 259,1 5 297 4.847661121 211 G FWFS 67110928 + 3 [md2_raid1] 259,1 5 298 5.414686200 0 C WFSM 67443871 + 1 [0] 259,1 5 299 5.414687398 0 C WFSM 67443871 [0] 259,1 5 300 5.986030998 212 I WFS 67110928 + 3 [kworker/5:1H] 259,1 5 301 5.986048919 212 D WFS 67110928 + 3 [kworker/5:1H] 259,1 3 475 6.031590146 262 A R 153082702 + 178 <- (259,9) 85971790 259,1 3 476 6.031590382 262 Q R 153082702 + 178 [bcache_writebac] 259,1 3 477 6.031591225 262 G R 153082702 + 178 [bcache_writebac] 259,1 3 478 6.031594879 262 D RS 153082702 + 178 [bcache_writebac] 259,1 5 302 6.559476971 0 C WFS 67110928 + 3 [0] 259,1 5 303 6.559478355 0 C WFS 67110928 [0]

The particular trace of interest here is the write to 67110928+3, most likely a superblock update (either to the mdraid superblock, or the bcache superblock, not sure which one; I could find out, but it probably doesn't matter). There are a couple of things that stand out about it. First and most importantly, it takes 1.7 seconds to complete, which is quite over the top, especially seeing as how the drive is doing virtually no other I/O at the same time -- it's almost alone in the queue. Second, it spends most of its time, a full 1.1 seconds, not even on the hardware, but between the "G" and "I" states. Third, even once it does get to the hardware (the "D" state until the "C" state), it still spends the remaining ~0.6 seconds on there, which is still quite over the top.

Now, I'm not so intimately familiar with Linux' I/O system that I can say for sure what the "Q" and "G" states really mean, so I don't know exactly what it means for a request to remain in the "G" state for so long. It could mean this is a pure software problem (perhaps that the code is waiting for some totally unnecessary lock, or whatever), or it could be that whatever function put the request in the "G" state is waiting for some other hardware event to complete before it proceeds to the "I" state (for example, waiting for a prior I/O operation to complete until this one can safely be made, for ordering reasons), in which case it might still be a hardware problem. If it's the latter, however, then it's not clear from this trace what it's waiting on, since the only other I/O operation that completes on this device in the meantime (the write to 67443871+1) completes over a half a second prior with no activity whatsoever on the queue in between.

So currently, there are some things that speak for this being either a software or a hardware problem. On the software side: The request spends most of its time just in the software side of things, and also there's no obvious hardware event that completes for it to continue. On the hardware side, the write still takes quite long once it does reach the hardware, and it's also surrounded by other requests that take long in the hardware (both the write to 67443871+1 and the later read to 153082702+178). Also seeming to indicate a hardware problem is that fact that only one of the SSDs is exhibiting this problem, while I can't see anything like it on the other one. I nevertheless figure that even that might just be that mdraid treats its two legs slightly differently in some subtle manner.

However...

My current observation is that there's some kind of weirdness with the kernel, however, which makes it a bit unwieldy to make certain kinds of observations, so my first plan of action is to upgrade the kernel from the current Debian 9 stock kernel, to the backport of the Debian 10 stock kernel (which allegedly fixes said issue). Perhaps, if the whole issue actually is a software issue, and with a little bit of luck, then that in itself might "accidentally" solve the problem.

Of course, to do that, I need to reboot the server. Should be a quick operation, just don't be surprised by the website and forums being down for a minute or two (or possibly more if something goes strange about it).

by **Sevenless** » Sat Mar 02, 2019 3:02 am

by **loftar** » Sat Mar 02, 2019 3:03 am

Oh, and since someone is going to ask, the SSDs in the system are Toshiba drives, model number "THNSN5512GPU7", both of them.

by **TheNater** » Sat Mar 02, 2019 3:10 am

Good luck!

I know you mentioned it's only happening on 1 drive on the Raid, but maybe you could physically swap the drives (AB to BA, this would help figure out if it's hardware or sofware). That is, of course if the update to the kernel doesn't help.

by **Kaios** » Sat Mar 02, 2019 3:10 am

loftar wrote:model number "THNSN5512GPU7", both of them.

that's racist

by **Ysh** » Sat Mar 02, 2019 3:13 am

Kaios wrote:
loftar wrote:model number "THNSN5512GPU7", both of them.

that's racist

What?

by **loftar** » Sat Mar 02, 2019 3:14 am

Great shitposting, guys. A+.

Rebooting in two minutes.

by **Sprymazi** » Sat Mar 02, 2019 3:14 am

"Mandatory comment on how we NEED 3-4 months free premium for 5 hours of downtime"

by **kirion** » Sat Mar 02, 2019 3:27 am

Hopefully it fix the lagspikes. Everything else is butter smooth.

by **Ejnekor** » Sat Mar 02, 2019 3:29 am

Goodluck, guys! I hope it will work. Lagspikes were kind of annoying lately...

Lagspike saga

Lagspike saga

Re: Lagspike saga

Re: Lagspike saga

Re: Lagspike saga

Re: Lagspike saga

Re: Lagspike saga

Re: Lagspike saga

Re: Lagspike saga

Re: Lagspike saga

Re: Lagspike saga

Who is online