borka wrote:k - i tried to sample the important stuff ( i quit after 53 pages ... :'( )
short version:
All filesystems are XFS. I suspect the lag spikes happen because of either XFS logging flushing, or by XFS' interaction with sync()-ing, but I have no hard proof. The practical effect, either way, is that some process (like the game server) might have to wait sometimes upwards of 10-20 seconds for some simple I/O operation like an open(), rename() or close() to complete.
Some Loftar quotes-
*Planned downtime: OS upgrade thread
I'm having some hopes that the newer kernel might have less lock contentions in the XFS driver, which might have a beneficial impact on I/O latency, but I can't say for sure.

---
Much of it is related to the fact that the hard drives are now RAID-1'd to prevent data loss from disk crashes. In world 3, I had the game data on a separate drive, which helped immensely, but was of course far more vulnerable, and there was in face a nuke situation caused by it. You can probably find it somewhere around the Announcements forum.
---
Notice, particularly, the absurdly high waiting times on dm-2, which is the LV holding the game data. It, in turn, is backed by md2, which is a md-RAID mirror backed by sda and sdb. dm-0 is the root filesystem, backed by the same VG as dm-2, and dm-1 is on a separate VG (backed by sda and sdb directly, without redundancy) and holds the minimap data. All filesystems are XFS. I suspect the lag spikes happen because of either XFS logging flushing, or by XFS' interaction with sync()-ing, but I have no hard proof. The practical effect, either way, is that some process (like the game server) might have to wait sometimes upwards of 10-20 seconds for some simple I/O operation like an open(), rename() or close() to complete.
---
No. I guess I wouldn't mind trying, but I doubt fragmentation is the issue, since it isn't a matter of the filesystem just performing steadily bad, but rather of sudden spikes in I/O latency. This is also part of the reason why I suspect that log flushing is part of the problem. I was hoping that Linux 3.2 would solve some of that since it uses delayed logging in XFS by default, but alas...
---
I've explained it previously both in this thread and others, but the concrete symptoms are that individual VFS operations seemingly randomly take Really Long (sometimes up to 20 seconds) to complete, particularly metadata-heavy ones, such as open(), rename(), close() and the like. The problem is that I don't know what they are blocking on, nor how to find that out. Processes running sync() or fsync() calls do seem to have particularly high probability of making other processes block, but they are far from exclusively responsible.
---
Well, to be fair, the cause of the lag is the same as it has been for the past year or so. It only changed in quantity, not in quality. If it were as easy as "trying stuff", Haven would never have had lag to begin with.
---
No, not really. In that regard, I think it speaks for itself that the increase in lag was caused by software changes, rather than hardware changes.

That's not to exclude the possibility that hardware upgrades could alleviate the problem, however. For instance, more RAM would increase the amount of data fitting into the block cache and therefore alleviate the need for disk I/O, which may or may not fix the problem (though I don't really think so either, because the current block cache size should be more than enough as it is, I think). I can also conclude that the server that Salem is running on has far (far) better I/O than the Haven server has, but I'm not sure whether that's mainly a hardware issue or if it might be related to it not using md-RAID or LVM (that hosting company uses a SAN instead).
---