« Planned Server Shutdown | Home | Server Shutdown Cancelled »
April 12, 2005
Server Failure
Mathserv crashed at 1:02 pm on Monday, April 11th. It was back up again at 1:15 am with no data loss.
Most msprime workstations are picking up where they left off, though some may require a reboot.
The cause of the crash was a disk failure on one of the two RAID 5 arrays. "Aren't RAID 5 arrays supposed to be able to handle a single disk failure?" you ask. Yes; yes they are. But this is the second time since January that this server has failed to do so. I think that I have identified and solved a hardware problem that could have precipitated both this crash and the one in March.
Note that the department web site (including course pages) were only down from 1:00 pm to ca. 2:30 pm. I used the 2:00 am backup copy to bring the sites up on our backup server.
Plans were underway to reduce the likelihood of this sort of failure as well as the downtime should it happen anyhow; I am accelerating the project.
For those who are interested: the hardware fault in question is a bad disk. After the March crash, the controller status report showed the bad disk was not in the array; it appears now that the array was in fact rebuilt with the bad disk. That disk has been replaced. In fact, replacing that disk was the point of the shutdown which I had planned for this Friday.
I'm going to be consulting with the RAID manufacturer about the failures. Once the new fail-over server arrives, the current server will go down for firmware upgrades and reconfiguration of the RAID set up.
Hi Ken,
I know you guys are working twice as hard to keep the network up, yet, whenever I find the server down, I feel very sad about it. Recently Microsoft claimed that their Windows server performs better than the Linux servers. There are controversies regarding the claim but still, too frequent a failur of mathserv only make me suspicious about its reliability (which I do not want to).
I feel your pain and even understand your flirtations with the dark side. But even Bill Gates could not save us from hardware failures. In each of the four failures we've experienced this year, the problem as been with the RAID file system, and the RAID software is not part of linux, it's built into the device.
It is possible, of course, that the hardware-RAID file systems would have failed more gently with Windows, but I doubt it. We have similar hardware on other linux servers and have seen both proper failures - i.e. a disk dies and the systems carries on - and improper one - i.e. a disk dies and takes the whole file system with it.
We're considering replacing the RAID controllers altogether in order to improve reliability, and we're going to have a second server in fail-over mode ready to take over should a crash happen anyhow.
KM