February 2011 Archives
February 28, 2011
The formatting problem is now mostly fixed and category and archive links are working again.
As will be obvious if you are viewing this blog directly (as opposed to via an RSS feed or the login message-of-the-day), an upgrade to the MovableType blog software broke the formatting completely. I'm aware of the problem and will fix it at some point.
February 25, 2011
SquirrelMail, used for mathmail.mcmaster.ca webmail, is miserably slow even though the server is more or less idling. I'll be looking at this today or Monday.
We lost power for two minutes just past 5 o'clock this morning. The outage was planned and announced by Facilities Services, but the announcement did not mention that Hamilton Hall would be affected.
Most computers will simply have restarted when the power returned; the odds of damage to computers and monitors is slight (though not zero). Note that the outage did not effect the compute servers or ms (the main file/email/web server).
February 24, 2011
We're up an running again as of 5:30 pm - which means that we were down for 90 minutes instead of the announced 30 minutes. While we had the system off-line, we moved to a larger storage system. So we're now running with more than twice the storage, double the RAM and twelve CPUs instead of eight.
While email and workstations were down for the entire period, web sites were up and down a few times - I had them up and running whenever I could safely do so.
February 23, 2011
The web sites hosted on ms.mcmaster.ca (most notably www.math.mcmaster.ca) will be down for up to a few minutes at a time Wednesday, Thursday and Friday. I will be experimenting with some configuration changes which will make it easier for us to keep the web sites up during maintenance and after system problems.
The main server will go down at 4:00 pm on Thursday. The server itself should be up again almost immediately but it may take up to half an hour for all services to resume (mail, web, workstations, etc.).
February 18, 2011
In order to reduce the load on the storage server while to digests the replacement disk, I've turned of IMAP access to mail boxes.
You can read your mail via pine or http://mathmail.mcmaster.ca.
I'll be turning things on periodically to test the ability of the storage server to accept the mail load. Access from on-campus locations will be turned on before access from off campus.
The main server (ms) is still sluggish; consequently, the workstations and web sites are sluggish and mail delivery is slow. The storage array is still working to incorporate the replacement disk and so its performance is degraded - and that affects everything which stores data there.
There may be brief service interruptions (of up to five minutes) if I decide to shift load to another server.
February 17, 2011
The ms server and the workstations have been agonizingly slow (at best) since about 8:15 this morning. A disk on our main storage array failed and the array was hobbled (in "degraded mode", for those who follow these sorts of things). We do not yet know why performance was as miserable as it was - it should have been poor, not horrible.
There were three interruptions of five to ten minutes as I sought the cause of the problem - working on the invalid assumption that it was our server again.
The disk has been replaced and the storage array is rebuilding itself. Performance is going to be poor until the rebuild is complete.
Workstations may need to be rebooted if they have got confused over the state of the links to the home directories (though I have forced a refresh remotely on all systems which were responding).
February 16, 2011
I stated in an earlier post today that "some web sites were partially down". I've had some questions about what that means, precisely.
All web sites hosted on ms.mcmaster.ca were down from 4:30 to 6:15 yesterday evening.
From 6:15 to 9:00 pm, many pages on the main math web site (the official-looking blue pages) were failing; other sites (e.g. iidda.mcmaster.ca, mathmail.mcmaster.ca) were OK, as were personal and course pages on www.math.mcmaster.ca.
From 9:00 pm yesterday to 9:45 am today, all www.math.mcmaster.ca pages were working from on campus and from VPN connections, but not from off campus. As of 9:50 am today, things were back to normal.
You may notice that some mail is arriving later than expected or in the wrong order. That's because mail which could not be delivered earlier when the server was busy or down was held upstream for a few hours before delivery was attempted again.
Ok - that was no fun. My clever-clever hop from one piece of hardware to another yesterday evening went from bad to worse: server performance was periodically horrible and some web sites were partially down.
We're now back to running perfectly well and normally on some borrowed hardware while I get this sorted out ... "this" being "being able to swap server hardware quickly and without significant downtime, frustration and grey hairs".
We will try the switch again in a few days - most likely Saturday afternoon.
Note that there is no worry of data or mail loss.
February 15, 2011
The half-hour of downtime scheduled for 4:30 this afternoon extended to nearly two hours: a theoretically routine hardware switchover wasn't. The upside is that we learned some new things about iSCSI storage arrays. The downside was ... well, two hours of downtime.
I am having a very unexpected problem with the web server: the main www.math.mcmaster.ca is failing, though other sites on the same server (mathmail.mcmaster.ca, wiki.math.mcmaster.ca, iidda.mcmaster.ca), personal sites (www.math.mcmaster.ca/matt etc.) and course sites (e.g. www.math.mcmaster.ca/S1cc3) are all fine.
The server and most of the workstations (which access home directories on the server) are sluggish this afternoon due to an as-yet unidentified cause.
Some services (e.g. web, email) may go down for a few seconds at a time and workstations may freeze for up to half a minutes while I do some poking around.
I'm going to take the main server off-line for about half an hour this afternoon starting at 4:30. I've been trying to keep the downtime required for this upgrade to a minimum and to off hours, but as time is pressing, we're going to have this daytime interruption.
Workstation, printing and email access will be shut off during most of this period. I will keep the web sites up for as much of the period as possible.
February 13, 2011
The downtime early Sunday afternoon lasted a little longer than I expected and was a little downer than I expected: all systems served by ms were down from ca. 2:15 pm to 3:00 pm. (mail and web were intermittently down between noon and 2:00 pm).
Everything is now back up.
Most workstations will probably need to be rebooted in order to work properly.
February 12, 2011
I still have a little more testing to do before finalizing some server upgrades. I will be taking services off-line between 11 am and 1 pm on Sunday. Web sites will stay up (read-only) with only very brief interruptions. Workstation, email and printer access will be down for five to 30 minutes at a time during this period.
February 11, 2011
I didn't finish the update work during the downtime scheduled for Thursday afternoon - nor was there any downtime to speak of. I will be taking services off-line between 3pm and 5pm on Saturday. Web sites will stay up (read-only) with only very brief interruptions. Workstation, email and printer access will be down for five to 30 minutes at a time during this period.
February 9, 2011
Firefox 3.6.13 and Thunderbird 3.1.7 have been installed on the linux workstations. You will see a slightly confusing question about making these new versions your default when you next run them.
This may seem obvious to some and hardly worth mentioning, but I mention it because it is demonstrably not obvious to others.
Now and again I notice people standing at the HH-303 printer, staring and wondering where their printouts are. In most of those cases, the printouts were in the racks immediately to the left of the printer.
The printout rack may seem like a layer of complication, but it's there to keep the small table area around the printer from turning into a mess of scattered paper in which print jobs become - inevitably - lost.
I'm going to be taking services off-line for about one hour on Thursday and Friday mornings so that I can complete some server work. The Thursday outage will start at 7:30 am and will affect workstations and mail intermittently; web sites be largely unaffected. The Friday downtime will start at 7:00 am and will affect workstations and mail; web sites will be up most of the time.
February 7, 2011
SAS is unavailable temporarily because the only Sun Solaris system, redpine, has crashed. I am working to bring redpine back on-line. The department is reconsidering a linux license for SAS in the event that redpine isn't repairable (the cost for the linux version something line twenty times that of our current Solaris license, thus the reluctance to retire venerable redpine).
The plan, at present, is for the department to make SAS available one way or another.