Recently in Servers Category
February 23, 2013
The recent cooling problem in the Hamilton Hall server room persists, I'm afraid.
In order to keep the temperature down, I've shut off the servers which were either idle or did not have any jobs with significant runtime on them:
These servers are still running:
Before I go so far as to shut the above servers down, I may start pausing jobs so as to reduce the heat output. I will let you know if your jobs are affected.
My priority is to keep ms.mcmaster.ca and the storage servers (and its many disks) healthy so that we have web, email, file and workstation services.
I will follow up with Facility Services on Monday morning.
December 17, 2012
The stats compute server, bayes, failed early this term. I've replaced this 2005 physical box (four-32-bit-core Xeon, 16Gb RAM) with a virtual server with eight 64-bit Xeon cores and 24Gb RAM. This new server is also called bayes and is basically identical to anatolius, the general-purpose compute server.
SAS will be available on bayes later on this week.
November 20, 2012
Anatolius is back up with eight processors. Which is not the full complement, but it is better than four. I'm still sorting out license problems.
November 19, 2012
I'm taking anatolius down today in order to complete the upgrades originally scheduled for the 5th.
November 5, 2012
I will be taking anatolius down on Thursday morning in order to upgrade the underlying software. When it comes back up, it will have more processors available. I expect anatomies
to be back up later on the same day.
Mathematicians (Photo credit: KennethMoyle)
If this will be a great problem for you, please let me know as soon as possible.
October 12, 2012
As of 11:00 pm, the storage array (which died this morning) has been recovered and the main file system is once again mounted on ms.mcmaster.ca:
- email should be responding normally;
- files in home directories are editable;
- workstations will allow logins.
A check of the file system revealed no errors so we don't expect that there was any file loss.
Incoming mail was only briefly interrupted; no mail should have been lost though some may have bounced. Web sites were up except for a few periods of a few minutes.
My thanks to Todd Pfaff for diving into the XML config files of the storage array when things got weird.
The disk array from the dead storage server is now running in the fail-over server. A priority over the weekend and early next week will be to bring the dead storage server back to life so that we have a live fail-over server again.
The file-storage array is still down but we are making progress.
I believe that I have identified the hardware which needs to be replaced and I have the spare part, but replacing it will involve disassembling much of the server and will take several hours.
Before beginning that process, I am - very carefully - attempting to bring up the disks from the production storage array in our fail-over system. If this is successful, then everything should be up this evening. If it does not work, I will replace the failed part on Saturday.
The storage-server problem and resulting service outages are described in this earlier message. Here's how things look for recovery.
My first priority will be to recover the primary storage server. My second priority is to prepare the backup storage server to take over for the primary one; unfortunately, that server was already being worked on and is will not need a few hours work to finish some upgrades.
If all goes well, we should have one of the two storage servers on line and all services working by the end of the day. It is possible that the current state (web up; email mostly up; workstations down) may continue through the weekend.
We continue to have server problems: our main storage array is not rebooting, and that array holds all of the home directories all most of the mail. Needless to say, I am working on it.
Here is the status of various services:
- web sites are up (after a ten-minute downtime)
- new mail is being received (for now)
- email can be read or sent via mathmail.mcmaster.ca and read via pop/imap cients, but ...
- you may not be able to delete or move mail from you inbox, depending on your client configuration
- mathmail.mcmaster.ca will allow you to delete mail but not move it
- spam filtering is turned off
- you can get to the contents of your home directory via sftp or Windows file sharing, but you cannot modify anything
- most linux workstations will not be working properly
- printing from laptops or stand-alone office desktops works (though it didn't work early this morning or for a period mid-morning)
- the compute servers will work, but you'll need to copy any files you wish to work with to /1/home
There will be interruptions now and again today as I work to bring the files back on line.
There were/are two problems with the main server (ms.mcmaster.ca) this morning, both related to file systems. The result were that ...
- mail was not being delivered between ca. 5:00 am and 9:30 am
- messages could not be deleted or moved between mailboxes
- files in home directories could not be modified
The first problem was that the file system used for inbound mail had filled up rather suddenly for reasons I will investigate shortly. The second is that main file system (which houses the home directories) is now mounted read-only - I am still looking into that.
Web sites were not affected by either of these problems (most web files and the databases are stored on a separate file system).
August 27, 2012
Anatolius is running with only four CPUs right now, while I sort out a licensing problem with the underlying VMWare installation. I expect to have it fixed this week.
July 10, 2012
Bayes is down with a hardware error (two, actually); I am awaiting delivery of replacment parts.
June 5, 2012
We have a new general-purpose computation server for the department, anatolius. Anatolius - named for the same St. Anatolius of Alexandria, the ancient mathematician and philosopher who inspired the suspended statue in Hamilton Hall - is a virtual server whose configuration will change as resources become available. At present, the server has ...
- Processors: eight Intel Xeon X5650 @ 2.67GHz
- RAM: 24Gb
- Scratch disk: 80Gb
Software on anatolius includes ...
- R (with BioCondcutor)
- Fortran77, Fortran90
In the next week or two, we will be adding another one or three virtual compute servers - with up to eight processors. By the end of the Summer, anatolius should have twelve to sixteen processors.
Bayes - an eight-processor Xeon MP with 16Gb RAM - is stil available for stats use, but it is not nearly as fast as anatolius.
February 18, 2012
Bayes is back up and new fans have been ordered. For the time being, bayes will throttle itself down to about 1/4 CPU speed if it gets too hot, but it can still be used.
January 23, 2012
I will be taking bayes off line on Wednesday afternoon from 2:00 pm until at least 4:00 pm in order to diagnose a hardware problem and upgrade the operating system.