March 17, 2005
Server Crash, Recovery
Mathserv crashed when a disk failed on Wendesday afternoon and came back on-line at 1:00 am on Thursday once the disk array was rebuillt. Again? you ask. Yes and no. Yes, this is the fourth time this year (and the fourth period since we moved to a new server in the fall of 2003) that we've had Web, email and file-server access go down due to a disk problem; but no, since all four crashes have been due to different causes.
We are part-way through implementing system changes which will allow us to have both secure backups and fast fail-over when a file systems fails. We are also trying to determine why we (as well as both Physics and Psychology) have seen so many problems with RAID file systems. By the end of term, we should understand the failures better and will have the servers configured to minimize the impact of the failures that we can't prevent.