There are two SSDs in the system, which are RAIDed together in a mirror, for redundancy. One of these is acting strangely: Monitoring the I/O to it, it exhibits periods where the drive just seems to be "idle", often with just one, two or three, but sometimes a hundred or more, I/O requests in its queue that inexplicably just don't progress for several seconds. Most of these events pass by unnoticed, but sometimes some synchronous event happens that causes the game server to have to wait until the congestion clears up. Performing more I/O causes these events to occur more often.
The question, then, is what causes this, and I'm currently trying to figure out if it's a hardware or software issue. When I say that the SSD isn't processing I/O requests, what I'm actually looking at is the software I/O queue, rather than the queue of the device itself (I don't know of a way to examine that directly), as seen through
/sys/block/<device>/stat on Linux. In order to get a bit closer to the problem, I've been looking at I/O traces on the device, as captured by
blktrace. Let's have a look at a snippet of one of these events:
- Code: Select all
259,1 5 295 4.847660114 211 A FWFS 67110928 + 3 <- (259,9) 16
259,1 5 296 4.847660726 211 Q FWFS 67110928 + 3 [md2_raid1]
259,1 5 297 4.847661121 211 G FWFS 67110928 + 3 [md2_raid1]
259,1 5 298 5.414686200 0 C WFSM 67443871 + 1 [0]
259,1 5 299 5.414687398 0 C WFSM 67443871 [0]
259,1 5 300 5.986030998 212 I WFS 67110928 + 3 [kworker/5:1H]
259,1 5 301 5.986048919 212 D WFS 67110928 + 3 [kworker/5:1H]
259,1 3 475 6.031590146 262 A R 153082702 + 178 <- (259,9) 85971790
259,1 3 476 6.031590382 262 Q R 153082702 + 178 [bcache_writebac]
259,1 3 477 6.031591225 262 G R 153082702 + 178 [bcache_writebac]
259,1 3 478 6.031594879 262 D RS 153082702 + 178 [bcache_writebac]
259,1 5 302 6.559476971 0 C WFS 67110928 + 3 [0]
259,1 5 303 6.559478355 0 C WFS 67110928 [0]
The particular trace of interest here is the write to
67110928+3, most likely a superblock update (either to the mdraid superblock, or the bcache superblock, not sure which one; I could find out, but it probably doesn't matter). There are a couple of things that stand out about it. First and most importantly, it takes 1.7 seconds to complete, which is quite over the top, especially seeing as how the drive is doing virtually no other I/O at the same time -- it's almost alone in the queue. Second, it spends most of its time, a full 1.1 seconds, not even on the hardware, but between the "G" and "I" states. Third, even once it does get to the hardware (the "D" state until the "C" state), it still spends the remaining ~0.6 seconds on there, which is still quite over the top.
Now, I'm not so intimately familiar with Linux' I/O system that I can say for sure what the "Q" and "G" states really mean, so I don't know exactly what it means for a request to remain in the "G" state for so long. It could mean this is a pure software problem (perhaps that the code is waiting for some totally unnecessary lock, or whatever), or it could be that whatever function put the request in the "G" state is waiting for some other hardware event to complete before it proceeds to the "I" state (for example, waiting for a prior I/O operation to complete until this one can safely be made, for ordering reasons), in which case it might still be a hardware problem. If it's the latter, however, then it's not clear from this trace what it's waiting on, since the only other I/O operation that completes on this device in the meantime (the write to 67443871+1) completes over a half a second prior with no activity whatsoever on the queue in between.
So currently, there are some things that speak for this being either a software or a hardware problem. On the software side: The request spends most of its time just in the software side of things, and also there's no obvious hardware event that completes for it to continue. On the hardware side, the write still takes quite long once it does reach the hardware, and it's also surrounded by other requests that take long in the hardware (both the write to 67443871+1 and the later read to 153082702+178). Also seeming to indicate a hardware problem is that fact that only one of the SSDs is exhibiting this problem, while I can't see anything like it on the other one. I nevertheless figure that even that
might just be that mdraid treats its two legs slightly differently in some subtle manner.
However...