February 2009 Archives
February 25, 2009
We will be keeping backups of changed/deleted files for only ten days instead of the usual two weeks (four weeks in the case of mathserv) for the next few days due to a disk-array problem on our backup server.
February 23, 2009
Maple was not working on most servers and workstations due to a problem with the license server; the problem has been corrected and Maple works again.
February 22, 2009
I will be rebooting the primary file server at 9 PM Sunday in order to complete some software upgrades. Web, email and workstation access will be interrupted for ca. 10 minutes.
All services are back on line as of 1:40 PM following the earlier crash. The was no file loss; most workstations will start working again without rebooting. Investigation continues.
The primary file server crashed early Sunday morning. I am going to keep the server off-line while I try to isolate and fix the problem. Mail, ssh and workstation logins will be down for the next few hours. Web access will stay up most of the time.
February 20, 2009
Workstation access is restored. That's the end of today's power-related problems.
I spoke too soon: the mail file server did not come up cleanly. I am working on it.
We did, as expected, lose power to the server room this afternoon at ca. 4:30. After 15 min. our battery backup was drained and I had to shut the servers down. Web and email are back up as of 4:55. Workstation access is failing for as-yet unknown reasons. I am investigating.
Facility services plans to cut the emergency power this afternoon at ca. 3:00. If this cut affects the server room - which it should not, but that didn't help us this morning - then the compute/group servers will be shut down but the primary servers will stay up and there will be not effect on email, web or workstations.
The power loss in the server room thing morning was related to the planned cut to emergency power in Hamilton Hall. But the server room battery-backup unit is meant to be on regular power, so this was not expected. Facility Services is investigating.
The primary servers and several group/compute servers rebooted unexpectedly at 9:10 this morning. The only commonality is that they are all on the same battery backup unit. We are investigating.
February 18, 2009
We discovered at ca. 9 PM that mathserv was not accessible from off-campus (unless using VPN access) and had not been since ca. 2 PM. This was related to the network problem affecting workstations this morning and is fixed as of 9:30 PM.
The primary server is OK again. If your workstation is still weird (no login, no mail, no icons), please reboot. It appears that a latent network problem which was around for a few days became quite actual (possibly triggered by the file server crash).
Problems with mathserv persist. I may need to reboot again between now and 2 pm.
I'm going to reboot mathserv at 12:15 today in order to resolve scattered workstation problems related to the file server crash early this morning. Web, email and workstations will be down for 10 - 15 minutes. Reboot your workstation if it doesn't start behaving properly by 12:30.
While all systems were go at 7:30 today, I neglected to turn off the logon block until 7:50. So now you can login and get mail.
The main file server crashed at 1:15 am today; web sites were up again at 6:30; all services are up again as of 7:30 am. Mail is catching up quickly; we ran for 15 seconds without a spam filter, so expect a burst; most workstations should work without rebooting.
The main file server is down. I don't yet know why yet, except that it has nothing to do with the upgrades (since they are being done to a different server). But am going to look shortly.
February 16, 2009
February 10, 2009
The planned server reboot happened at 5:20 rather than 5:00, but was quick and otherwise successful. Thanks for enduring yet another (brief) interruption.
It seems I only post bad news. Here's some good news: I upgraded the capacity of our backup server on the weekend and we plenty of room for our growing data. Presumably no one (except my wife) noticed.
A note about the following updates were emailed to all department members 2009/02/10 at 12:30 pm: Server Reboot this Afternoon; Downtime for Upgrades During Reading Week; Server Crash Tuesday Morning; Bluespruce More than Down.
I will be rebooting the primary file server at 5 pm this afternoon in order to implement some OS updates. Email and workstations will be down for ca. ten minutes.
I will be upgrading hardware and software on the primary department server over reading week. Most services will be up through the upgrades, but there will be periods of five to fifteen minutes on Monday afternoon and Tuesday afternoon when email is down and the workstations do not respond; there should be only very brief web interruptions.
Bluespruce, which crashed last week, is not coming up anytime soon: the problem is serious but so fair difficult to pinpoint. The system will likely be down for at least a week or two. I am looking into making another compute server available during the interim.
The primary file server crashed at 6 AM today and came back up after file-system repairs were complete at 9:50 AM. The primary web and mail server took collateral damage but was back up at 10:15, at which point all services were running again. Things will be slow for another couple of hours while file-system repairs are completed in the background and the mail queue filters 5000+ messages (mostly spam, of course).
February 3, 2009
Bluespruce has crashed and won't boot back up. We are investigating.
February 2, 2009
The ms workstations went wonky/hangy for five minutes late this afternoon; one of the servers didn't take well to a performance tweak (the tweak is now untwuck).