April 12, 2005
Mathserv crashed at 1:02 pm on Monday, April 11th. It was back up again at 1:15 am with no data loss.
Most msprime workstations are picking up where they left off, though some may require a reboot.
The cause of the crash was a disk failure on one of the two RAID 5 arrays. "Aren't RAID 5 arrays supposed to be able to handle a single disk failure?" you ask. Yes; yes they are. But this is the second time since January that this server has failed to do so. I think that I have identified and solved a hardware problem that could have precipitated both this crash and the one in March.
Note that the department web site (including course pages) were only down from 1:00 pm to ca. 2:30 pm. I used the 2:00 am backup copy to bring the sites up on our backup server.
Plans were underway to reduce the likelihood of this sort of failure as well as the downtime should it happen anyhow; I am accelerating the project.
For those who are interested: the hardware fault in question is a bad disk. After the March crash, the controller status report showed the bad disk was not in the array; it appears now that the array was in fact rebuilt with the bad disk. That disk has been replaced. In fact, replacing that disk was the point of the shutdown which I had planned for this Friday.
I'm going to be consulting with the RAID manufacturer about the failures. Once the new fail-over server arrives, the current server will go down for firmware upgrades and reconfiguration of the RAID set up.