Recently in System Announcements Category
April 12, 2012
April 26, 2011
April 18, 2011
Gosset did not come up properly after Friday's power problem; we are investigating. It may be several days before we get it back up.http://www.math.mcmaster.ca/blogs/archives/computing_news/2011/04/serverpower-pro.html
April 5, 2011
Because ms.mcmaster.ca has moved between buildings (from ABB to HH), it has been given a different IP number (i.e. network address). You should remove the old entries for the server from your ssh host-key file in order to avoid dire warnings of "Offending keys".
ssh-keygen -R ms
ssh-keygen -R 220.127.116.11
February 25, 2011
We lost power for two minutes just past 5 o'clock this morning. The outage was planned and announced by Facilities Services, but the announcement did not mention that Hamilton Hall would be affected.
Most computers will simply have restarted when the power returned; the odds of damage to computers and monitors is slight (though not zero). Note that the outage did not effect the compute servers or ms (the main file/email/web server).
December 16, 2010
Anatolius, the small but highly available compute server, will be unavailable until the start of January :|
November 30, 2010
Things are back to normal; reboot your workstation (alt-ctrl-F1 then alt-ctrl-del) if things are weird for you.
We are having a problem with our server infrastructure today: late this morning this resulted in workstations freezing for 30 seconds or so two or three times; over lunch, the server had to be restarted; and early this afternoon some workstations are unable to access home directories.
We are working on the problem.
Mail delivery, web sites and Windows file sharing are not affected.
November 11, 2010
The flakey (though new) storage server continues to be flakey and will not stay up long enough for us to get the file updates to the fail-over storage. We have disabled logins and email for the next hour or so.
Web sites remain up using a different file server - though changes made this morning are not reflected as we are using last night's backups.
While our new server is stable, we are having repeated problems with a borrowed storage server: it crashed yesterday afternoon and again this morning, taking email, web sites and the workstations down with it.
As we speak, we are getting a fail-over system ready ... two, actually. Workstation and mail performance will suffer while we are copying data from the current system.
There will be brief periods of downtime without advance warning so that we can take the unreliable storage system out of play as soon as possible.
Note that you can subscribe to Computing News blog entries to keep abreast of service announcements - see the SUBSCRIBE VIA EMAIL in the right-hand column.
October 15, 2010
We introduced some changes to the main file server last night and we're still in a shake-down period this morning. Your ms linux workstation may need to be rebooted and the systems have been sluggish. We're working on things - they should, in fact, be much better as of about 11:00 am.
October 14, 2010
Authentication, printing and Windows-file-sharing (smb) services on mathserv were turned off this morning. For information on using the new server, ms, see the blog entry "Server Upgrades: Things You Need to Change".
September 23, 2010
Mathserv is not gone yet, but the doors are closing one by one. SSH/SFTP to mathserv are now blocked; instead, please use ms.mcmaster.ca.
September 18, 2010
The new server is now handling most of the services formerly handled by mathserv.
Most people have either switched to using ms.mcmaster.ca or are using alias which now point to the new server. But a few people are connecting directly to mathserv.mcmaster.ca for mail, printing or file access. If you are one of those people, I'll be emailing you directly, asking you to move over change your configurations (or habits) as described in the earlier blog entry, "Server Upgrades: Things You Need to Change".
September 16, 2010
We might have missed installing your favourite application during the workstations upgrade. If you can't find something you need or if something appears to be not working right, please email firstname.lastname@example.org.
We are upgrading the workstation operating systems to Mandriva 2010.1 over the next few days.
During the upgrade process - which takes about 30 minutes - your computer will reboot and spend most of its time sitting on a black login screen. Don't login at this point.
Once the upgrade is complete, your computer will reboot a second time and come up with a plain, blue login screen (i.e. without the DNA graphic which was there before). At this point you can login.
Following the upgrade, you should find that you workstation is more responsive and slightly cuter.
At this point, we are only upgrading systems which don't have anyone logged into them. We'll announce a plan to deal with stragglers next week.
September 15, 2010
Mathserv will be going down for extensive upgrades on the morning of Tuesday, September 21st ... after which it will no longer be mathserv. Please make sure that you are using ms.mcmaster.ca - the new server - in place of mathserv.mcmaster.ca for ssh/sftp, pine, mail clients, etc. before then.
The wikis at wiki.math.mcmaster.ca will be moving to the new server today (Thursday). There will be several interruptions of a few seconds to a few minutes. I recommend that you avoid making updates today until I announce (on this blog and at wiki.math.mcmaster.ca) that the move is complete.
September 13, 2010
We are in the process of putting our new admin server into production: web, email, wiki, file sharing, etc. will be moving from the current server, mathserv.mcmaster.ca, to the new server, ms.mcmaster.ca. The new server - together with some configuration and file-server changes - will speed some things up immediately and allow us to expand and improve other things in the coming months (things = web, wiki, mail, workstations, etc.).
Over the next week, there will a number of brief interruptions to individual services (mail, web, wiki, file-server access) as well as a one- to two-hour shutdown of email and workstation access. There will be a few more brief and short-term interruptions over the next two months as we increase the size and speed of our file servers.
The brief interruptions - that is, between a few seconds and a few minutes - will not, in general, be announced; I will post/email announcements about extended downtime.
Mathserv runs dozens of web sites and other services. We tested the major components on the new server ahead of time, but we're certain to have missed something. Please email email@example.com if you come across anything weird or wonky.
September 10, 2010
Our main server blew a disk this morning and is struggling while a spare is built into the main storage array. In order to allow the array to rebuild more quickly, I will be turning off mail services for up to half an hour at a time. Other services (workstations, Windows file sharing) may also be interrupted.
I will probably leave the interface at mail.math.mcmaster.ca up all the while, though.
September 9, 2010
Some people are having problems logging into their linux workstations as of yesterday: after logging in, the desktop is blank and there are no menus or icons. Not everyone is affected and I don't know the source of the problem yet.
You can work around the problem in the meantime by choosing the KDE desktop from the Session menu on the login screen.
August 3, 2010
Bayes has crashed three times since late Friday night. We are investigating but have not yet isolated the problem. It's up now and you can use it, but I wouldn't bet serious money on it not crashing again.
July 19, 2010
The Math & Stats servers will be down from 9 AM - 11 AM on Thursday, July 22nd while we install new equipment in the server room. All of the ms workstations will be down, the computation servers will be turned off, and email will be unavailable (mail sent to our server should simply be delayed). Note that a read-only version of the www.math.mcmaster site will be up on a backup system during the downtime.
May 20, 2010
I've updated the security certificate on the primary Math & Stats web server (www.math, mail.math, wiki.math, etc.). Some people will stop seeing warnings messages; most people should see no effect. But if you are asked about a new certificate, simply accept it.
May 10, 2010
One of our file servers is acting up and workstations are intermittently slowing down or temporarily freezing; accounts starting with a to l are most severely affected. Web and email access are affected to a lesser degree. I am working on the problem. I may have to reboot the problematic server later on this morning.
I believe that most workstations will start working again without having to reboot.
February 8, 2010
We are running with two file servers again and last week's performance strain should be over. Anyone whose username beings with m-z who was logged into one of the ms workstations before 8 am today should log out and back in (or press Alt-Ctrl-Bksp) to avoid session instability.
February 5, 2010
As the second file server was already down and we are failed over to a single server I'm taking this opportunity to upgrade the size and speed of the server's main file system (originally planned for next month). This means that workstation and web-site performance will be sluggish until Monday morning.
Workstations and email (but not most web sites) will be down from 7 am to 8 am next Monday while I bring the second file server back into production mode.
February 3, 2010
- instead of firefox, use epiphany (see the Internet menu) or opera (opera at command line or Alt-F2)
- instead of thunderbird, open a terminal window and run pine to read email
February 2, 2010
Since we failed over to the single server some workstations are losing access to the home directories now and again - most applications will give a semi-sensible warning to the effect that your home directory can't be found. If you wait for no more than one minute, you should find that your home directory is accessible again.
I will look into this further when both servers are fully on line again.
February 1, 2010
We are still running on one server instead of two and workstation and website access is still slow. I hope to return half of the load to the second server on Tuesday morning. Note that there may be brief interruptions to mail client access between now and tomorrow morning.
As we suffer the sluggishness of running on only one (six-year-old) server, I might mention that we have a new primary file/mail/web server on order and should have it installed and running in the next month or so. The new server will not only be faster but will allow us to arrange for much faster failover in the case of problems. We will be scheduling several hours of downtime in order to move to the new server and will give you plenty of notice.
One of the two main file servers was found to be having trouble at 7 am today. We are working on the problem. Mail has been turned off for now; web and workstation access will be interrupted half an hour or so.
October 1, 2009
Some people were having trouble with access to their home directories (or logging in) from their workstations this afternoon; I believe that the problem is resolved.
August 27, 2009
Workstation and mail were back on line as of 5:20 pm following the performance updates. Workstations may be sluggish for another hour as some system work continues in the background. You should reboot your workstation if you see anything strange, though it may not be necessary.
July 16, 2009
Contrary to my note yesterday, I will be shutting down the compute servers before the power outage (A/C will be off in the server room and we need to reduce the chance of over heating the room).
July 15, 2009
June 25, 2009
There was a ten-second power outage this morning. The servers stayed up but all workstations (except the few on battery backup) went down.
June 3, 2009
We are still running on one server instead of two while the recovery of the large main disk array continues. Workstations will be a bit sluggish at times. We should be back on two servers some time Thursday.
June 2, 2009
Workstation access is still being restored; most will be ready by 10:00 am.
June 1, 2009
I have declared the second failed disk in the main data array officially dead after following a few false leads. Any mail received and any file changes between 4:30 am and 10:15 am are irrecoverably lost.
We are now running with the backup of the home folders on the fail-over file server (which is actually mathserv, the mail/web server).
Mail is flowing again as of 5:20 pm. Access to mail clients was opened at 5:30 pm.
Workstation access will be down until Tuesday morning.
Mail, web and workstation may be slow Tuesday while I get the main file server into full service.
Note that all web sites are back up after a brief interruption. Web sites under home directories (e.g. www.math.mcmaster.ca/~moylek) are available read-only from the backup server and so cannot be modified.
A second disk failed in the main data array at ca. 10:45 this morning. I am going to be taking the file server down to investigate. Workstation and mail will be down; most web sites will stay up. I will post an update before noon.
A disk failure on Sunday evening has left the main data array running slowly and slowing down workstation access while the array is rebuilt with a spare disk. I may be deactivating imap access to mail periodically to relieve load.
February 10, 2009
It seems I only post bad news. Here's some good news: I upgraded the capacity of our backup server on the weekend and we plenty of room for our growing data. Presumably no one (except my wife) noticed.
February 3, 2009
Bluespruce has crashed and won't boot back up. We are investigating.
February 2, 2009
The ms workstations went wonky/hangy for five minutes late this afternoon; one of the servers didn't take well to a performance tweak (the tweak is now untwuck).
December 14, 2008
The workstations are able to connect to the file server as of 4:30 pm. All major services are now fully operational as far as my testing shows. Things we be slow this evening while the file systems are being rebuilt, though. Send us email if you see any problems.
Mail services are back on line and ssh logins are no longer read-only. There will be a delay with workstation access while a file-system problem is corrected.
While the primary file server is being upgraded, the following are up:
Mail delivery, webmail, and imap/pop mail are down until the file server comes back up.
The announced system downtime has been pushed forward a bit and the systems will go down at 1 pm. Web service will come back shortly thereafter and other systems about an hour later.
December 12, 2008
While all systems are down due to the network upgrade this Sunday I will be upgrading hardware and software on our primary file server. The file server, email access and workstations will remain down for about an hour after the network comes back up; most web sites will be accessible immediately.
December 8, 2008
The ms-workstations went pretty much unresponsive for about two minutes mid-morning and for about ten minutes late this afternoon. These hiccoughs are related to the recent weekend crashes and my attempts to ameliorate things. You may see similar, brief problems again this week, though I am, of course, trying to keep interruptions to a minimum. Your patience as we try to sort out this server problem is appreciated.
If your workstation stops responding or gives strange errors this week, please wait five minutes before rebooting - it will very likely come back to life with all applications and windows still open.
December 7, 2008
The workstations and mail are functional again as of 10:30 am (other services where up earlier or didn't go down at all).
Our primary file server face-planted early Sunday morning. Email is down but web service is restored as of 9:20 am. All services should be up by 10:00 am.
Efforts to determine the elusive cause will be intensified this week.
December 3, 2008
Workstations went wonky for a few minutes late this afternoon. Mea culpa - I introduced an network error which affected most workstations while fixing another problem.
November 28, 2008
Mathserv was rebooted just after 3pm today (later than announced) but was only down for two minutes. All services are back to normal now.
I will be rebooting mathserv at 2:30 pm today to sort out some lingering problems. Web, email and workstations will be down for about ten minutes.
Please don't reboot your workstations; they will freeze when the server goes down and should return to life when the server comes back up.
Systems were down for half an hour late this morning because of the servers seized up. Everything is back on line as of 11:52 and we are investigating.
November 23, 2008
The primary server was effectively unresponsive due to a network problem with the file server. The servers and systems are responding normally as of 3:15 pm.
November 9, 2008
The file server is now fully operational and workstation access has been restored.
The primary file server crashed early Sunday morning. As of noon Sunday, it is running again and the main server is now serving mail, web, etc. Workstation are still down while one of the file systems is rebuilt; full access should return early this afternoon.
November 2, 2008
In addition to the expected network downtime this morning, we had two outages on departmental servers: one of the two file servers crashed early Sunday morning and restarted a little after noon; the web server was down for twenty minutes on Sunday afternoon.
October 31, 2008
Most of the linux workstations came up fine on their own after the power outage. The main servers and internet connections stayed up on backup power, so there was no disruption to email or web services.
October 29, 2008
We took advantage of the power outage to do some extensive server maintenance, much of which would be difficult to do when the systems are live. Workstation access was available this evening at ca. 6 pm and email and other systems at about 8:30 pm. Workstation, web and even email users should all see significant improvements in response times.
The power is back on in Hamilton Hall but the server and systems are still off-line while I complete some opportunistic maintenance and upgrade work.
The power will be out in Hamilton Hall until 5 pm today. I am going to be taking everything but the web server off line shortly.
September 9, 2008
Mathserv was partially unresponsive from ca. 4:30 to 6:45 this evening: some things (printing, parts of the web site, existing shells, some workstations) were up, other things were effectively dead. I restarted the server at 6:45 and as of 6:55 all systems are functional again.
July 23, 2008
Mathserv is back up as of 5:12pm after the announced downtime for a memory upgrade.
July 18, 2008
Bayes was rebooted at ca. 3:30pm today in order to clear up a memory problem.
May 6, 2008
The systems were back up on Monday morning just after 9 o'clock following the scheduled power outage by Facility Services on Sunday evening. Sunday's backups were caught up on Monday night..
May 1, 2008
Mathserv came alive again at 9 am. The other servers and workstations are able to boot as of 9:10 am. Don't forget that we do this again Sunday evening through Monday morning.
April 30, 2008
As described in my earlier posting, I will be shutting down all servers before the power outages tomorrow morning and Sunday evening. I will also be scheduling shutdowns of the stand-alone desktop linux workstations which we manage, including the grad-student/post-doc Dell GX 270s. I strongly recommend that you shutdown your Windows or OS X workstation before the power outages.
I am going to disable all backups but mathserv tonight since we won't have time to backup all systems.
Workstations and servers will be down or unavailable to some degree four times in the next week:
- Thursday from 5:00 am to 8:30 am due to a power shutdown by Facility Services
- Friday from 3:00 pm to 5:00 pm while I do fail-over testing
- Sunday morning from 7:00 am to 8:00 am due to network work by UTS
- Sunday evening / Monday morning from 6:00 pm to 8:30 am due to a power shutdown by Facility Services
UTS and FS, like RHPCS, have scheduled disruptive work for the end of the exam period, it appears.
April 24, 2008
I'm going to be renaming each of the standard linux workstations used by grad students and post docs in the next week or so. The current names are msx, where x is a prime number: ms002, ms003 ... ms587. These names are short and slightly cute, but most people don't know what their systems are called when asked by the sysadmins; everybody seems to know the location of their desks, however. The new names will be based on building, room and desk number, for example ms-hh-303-04.
I am going to be upgrading the standard linux workstations from Mandriva 2006.0 to Mandriva 2008.0 in early May; you'll be able to tell that your system has been upgraded by the change to the login screen. The new version is very much like the current one, only with updated applications, some interface simplifications, and a general increased shininess. Email firstname.lastname@example.org if you discover anything to be missing.
I will be upgrading a handful of workstations in late April in preparation for the roll out; I will let you know ahead of time if yours is to be upgraded early.
I will be taking mathserv down on Friday afternoon at 3:00 PM for about one hour in order to test our new emergency fail-over procedures. Web, printing, file-server and workstation access will be up and down during this period; incoming email will be on hold for the whole period.
April 4, 2008
The cooling systems came back on at ca. 6 pm last night and the server room was cool again this morning. The compute servers are all running again as of 9 am.
February 12, 2008
Mathserv, bayes, bluespruce and gosset will be rebooted on Wednesday afternoon between 4pm and 5pm in order to complete an important security patch. The linux workstations will freeze when mathserv goes down and then come back to life when it comes back up - about a five minute span. Of course, if there are problems, the systems will be down longer - perhaps 30 minutes.
January 28, 2008
Some bayes & bluespruce users will have seen messages like the following when ssh'ing in after the recent upgrades: "WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!", "IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!".
These messages were due to new ssh host-identification keys being installed during the upgrades; now that I have reverted to the previous keys, the messages should go away. Except for people who accepted the new keys, who will now want to clear them again with the command ssh-prune bayes or ssh-prune bluespruce.
January 24, 2008
Bayes and bluespruce are now upgraded and back on line. Note that R isn't yet working on either; the latest version will be installed on Friday.
January 21, 2008
Bayes and bluespruce will be down most of the morning of Thursday, January 24th so that I can upgrade the operating systems. Please let me know if this presents a problem for any long-running jobs.
NB: these upgrades were postponed from January 10th.
November 29, 2007
Mathserv came back up at 4:36 after the scheduled reboot and is now running an updated kernel; total down time was about two minutes. Thanks for your patience - we're now running much more stably and quickly.
Mathserv will be rebooted at 4:30 pm today in order to implement a kernel upgrade (intended to address the cause of the crash this morning). The reboot should take about ten minutes; workstations will freeze during the reboot and then start working again with all programs running after the server comes back up.
Mathserv crashed and was down for about five minutes at ca. 11:40 am today. This crash appears to involve memory and is not at all related to the problems of last week (we are running a different server now than we were then). We are investigating.
November 27, 2007
More updates to the original post-recovery status posting:
- Webmail has been configured and tested;
- the linux workstations were unable to boot between from Monday evening through late Tuesday morning (already booted workstations were OK - sort of);
- a rash of spam got through early Tuesday morning; the mail was being processed and scored, but it was scoring just below the threshold for spa (i.e. this was not related to any server problems).
Links to seminars, courses and personal info on http://www.math.mcmaster.ca is working again as of Monday afternoon.
The linux workstations became very slow on Monday afternoon and again this morning - in both cases as the number of workstations is use increased past a threshold. The problem is fixed and the machines are responsive again (following some performance adjustments to the server configuration).
November 26, 2007
Mathserv has been replaced with the fail-over server and most services are running again as of Monday morning.
Services not working or still to be tested
In summary, mathserv and dependent systems were slow on Thursday and Friday due to a double disk failure. Mathserv (and so the web sites and email and workstations) was down Friday evening and up on and off on Saturday until I replaced the server on Saturday evening. Spam filtering wasn't working until Sunday at noon. The network workstations started working at 9:30 on Monday morning.
Normally, our fail-over server would come to the rescue within hours, but the whole process was complicated by a problem with the fail-over server and the sheer volume of data stored on mathserv. Some of these problems are easily fixed. We still have the problem of the volume of data now nearly overwhelming our transfer capacity; it will take some thought, time and probably some money to overcome this limitation.
Though we lost productive time and perhaps some in-bound email routed through univmail, no mail or data was lost from mathserv or the linux workstations.
And for those who are interested, a more detailed description of the server drama follows.
November 25, 2007
All of the mail and user data from the former mathserv finally been copied to the new mathserv, though the former crashed five times in the process. I have enabled logins and access to mail.
I've not yet reviewed all systems: the linux workstations will probably not work yet, and I've not yet updated or tested web mail (http://mail.mcmaster.ca).
I will check on mail, web and ssh access on Sunday. I will look at the linux workstations and the rest of the mathserv services on Monday.
November 24, 2007
Mathserv crashed about an hour after it was brought up on Saturday morning. After two more crashes this afternoon, I've given up on it and have swapped in the fail-over server.
I am now in the process of recovering the rest of the data (mail, changes to user files) from Friday and Saturday morning from mathserv's disk array; until I have finished this process, you will not be able to login. I have already recovered all of the inboxes, so the new mathserv is accepting new mail and I will make those inbox available (at least read only) as soon as possible.
Mathserv is back on its feet as of 9:15 am today. The disk array has been repaired and the faulty hardware replaced; no data was lost.
The web sites are up already. Mail delivery will be brought up shortly. The msprime linux workstations will be brought on line later today, after backups have finished running.
November 23, 2007
Mathserv was to go down for hard disk replacements this morning at 7:30. I have deferred this work to Monday morning at 7:30 because Thursday's backups were not finished in time.
Mathserv is still hobbled by the bad disks and so susceptible to the same strain and slowness that we felt yesterday. I am going to be moving some load to the failover server in order to mitigate the effects on mail and the linux workstations.
November 22, 2007
Mathserv is still very slow. The problem will persist until I shut the system down to replace two bad disks on Friday morning at 7:30.
All of today's problems and the general slowness of the past week or so were due to first one disk in the main array failing last week and then another one failing yesterday*. It ends up that a degraded RAID array is far more of a drag on system performance than I had expected**.
Mathserv is still very slow. The problem will persist until I shut the system down to replace two bad disks on Friday morning at 7:30.
All of today's problems and the general slowness of the past week or so were due to first one disk in the main array failing last week and then another one failing yesterday*. It ends up that a degraded RAID array is far more of a drag on system performance than I had expected**.
More details in the full article.
Mathserv is very, very slow this morning - partly a consequence of the the hardware problem which I plan to fix on Friday morning. Most services - web, email, file access - are slow; a few - most importantly workstation booting - are down. Email may be down on and off until further notice. I am going to try to get things back up with minimal interruption, but I may have to take mathserv off line today.
November 21, 2007
Mathserv will down between 7:30 and 8:30 on Friday morning in order to replace a bad disk in the main array.
Yuck. Mathserv was down for longer than half an hour - it went down at 4:45pm and was back at 7:20pm. Everything is up, mail is flowing again; some workstations may need rebooting. I've worked around the hardware problem but will have to schedule some downtime in order to fix it properly - possibly next week.
I will be shutting mathserv down at 4:30pm today in order to address a hardware problem. It should be back up by 5:00pm.
Mathserv is incredibly strained today - and has been to a lesser degree on and off since last week - and consequently the workstations have been painfully slow at times. The problem appears to be due to imap mail access and so imap mail will be unavailable at times this afternoon and possibly tomorrow. You can still read mail via pine or http://mail.math.mcmaster.ca.
August 9, 2007
Mathserv went down as scheduled down at 4:34. The server, web sites, email and workstations were all operational again by 4:37.
I will be restarting mathserv today (Wednesday) at 4:30pm in order to sort out a performance problem. Web, email and msprime workstation access will be interrupted for ca. five minutes.
July 9, 2007
Bayes in back on UPS power source (it will stay up through short-term power outages); freesurface and bluespruce are still on an unprotected power source.
July 4, 2007
There is a problem with the UPS which provides power to bayes, bluespruce and freesurface. I'm afraid that you must consider these systems unreliable until further notice.
June 11, 2007
June 8, 2007
UTS has announced that the network interruptions from 7:00 am to 8:30 am this Sunday. Web and email will be interrupted and the networked workstations may freeze.
May 31, 2007
May 30, 2007
Mathserv was rebooted at 3:15 as the network problem got rapidly worse. The 4:30 reboot should not be necessary now as the network problem and resulting software errors are resolved.
I will be rebooting mathserv at 4:30pm today to sort out a network problem. Email, the web sites and the linux workstations will be unavailable for ca. ten minutes.
May 29, 2007
UTS has tentative plans to upgrade the Hamilton Hall network on Sunday, June 10th between 7:00am and 8:30am. Email and web will be inaccessible during that time.
I will make another announcement once UTS confirms the date and time.
I will be taking mathserv down on Friday afternoon at 4:00 pm in order to replace faulty hardware. I hope to have it back up within fifteen minutes, but the work may take longer.
Web, email and the msprime workstations will be inaccessible while mathserv is down.
May 24, 2007
Mathserv was unresponsive for about five minutes at 3:45 pm today - a network tweak gone wrong.
May 22, 2007
Physical Plant has scheduled a building-wide power outage for ABB from Saturday May 26 6pm until Sunday May 27 2am. We will be shutting down computer systems in ABB for this period. This means that the Math & Stats backup server will be down and there will be no backups of the servers and workstations on Saturday night.
Just be extra careful about deleting and changing files on the weekend. We expect that backups will begin again on Sunday night.
April 13, 2007
Some parts of the departmental web site will be unavailable for up to a minute at a time now and again on Friday and Monday while I sort out a database problem on the server. The parts affected will be those which draw from the departmental database: the directory of department members, course listings and seminar notices, primarily.
April 9, 2007
I rebooted mathserv at 4:10 - it was down back up three minutes later. Sorry for the short notice - a (mild) emergency related to last week's upgrades.
April 6, 2007
There were some expected and some unexpected (as one might expect) problems in the wake of the Wednesday-evening upgrade of the server. Some things have not been brought up or are broken on the new server are.
The web interface at http://mail.math.mcmaster.ca is not working. I plan to upgrade from SquirellMail to either a slicker application or at least a new version of the same software. Use a mail client or pine for the nonce.
SSL (https) connections are giving a warning as I haven't updated the security certificates yet.
April 5, 2007
The new mathserv is now in place. We're working out some kinks, as many have noticed. Here's the good news:
The updated server came on-line last night at 9pm. Some things that worked well during testing didn't work so well in production - printing is the biggest outstanding problem. Details follow; please email email@example.com if you encounter any other problems.
Web-Server and Database
There were problems with the conversion to the new version of the database server - the people and course pages didn't display details in the pop-ups correctly. This is fixed as of 11am today.
Spamassassin was not working for ca. one hour; you may see a rash of spam from late last night in your mailbox.
Printing from Linux Workstations
Printing works fine from mathserv and from Windows and OS X desktops and laptops (with a few exceptions which I cannot yet characterize). Linux workstations are only printing banner pages. Until we sort this out, you can print as follows:
April 4, 2007
As mentioned last week*, mathserv will be taken down at 4pm this afternoon. Expect mail, web, printer and workstation access to be unavailable for some three or four hours. That said, I will be bringing individual services back on line as soon as possible, so some things may be up before others.
There should be very little apparent down-time for the web server since I will be bringing up www.math.mcmaster.ca on the backup server at 2pm. Changes made to any sites after ca. 1:30 pm will not be reflected until later on this evening.
March 30, 2007
The server and workstations will go down for several hours on Wednesday at ca. 3:00. The web site will be up most of that time on a backup server.
March 28, 2007
Mathserv will be down for several hours starting at 4 pm on Wednesday, April 4th in order to perform some important system upgrades. Web sites*, email, network printing and network workstations (i.e. the msprime systems) will be unavailable while the server is down. Most services should be back on line by 8 pm.
March 12, 2007
Mathserv will be rebooted at 9:15 on Monday evening instead of Tuesday morning at 7:00 on Tuesday.
March 9, 2007
[The following was emailed to all Math & Stats faculty, post-docs, graduate students, admin staff and visitors on March 9th, 2007. KM]
Four recent updates are posted on the departmental Computing News blog:
Reminder: Printing Multiple Copies
March 09, 2007
The problem with multiple copies no longer affects OS X systems using the new SMB print queues. The problem still exists on the linux workstations and servers.
Reminder: New Print Queues
March 09, 2007
A reminder that as of March 1st, Windows and OS X systems need to use the new SMB (Windows-file-sharing) queues to access the shared printers.
March 09, 2007
You should check the time on your Windows or OS X system on Monday to make sure that the DST changes took effect.
Check Backup Status
March 09, 2007
The Math & Stats servers, linux workstations and many office Macintosh systems are backed up nightly. You can check the backup status of your workstation on this page. Note that most of the linux workstations and servers use the central...
February 17, 2007
Update: mathserv is back up ca. four minutes' downtime due to the planned reboot on Saturday the 17th.
Mathserv will be rebooted at ca. 6:30 pm Saturday the 17th in order to sort out a network problem; it should be down for no more than ten minutes.
msprime systems will freeze up while mathserv is down but will pick up just where they left off when it's back on. If you are using your workstation at 6:30 pm, just wait until control comes back - there's no point to rebooting.
January 29, 2007
Mathserv was rebooted at 3:25pm today. It was down for ca. four minutes during the reboot following a ca. ten-minute period during which response was very slow.
The slowdown and reboot were due to the same network problem which forced a reboot in December and which will be fixed once the updated server is fully configured - I will have an annoucement about some scheduled downtime for that switchover shortly.
December 20, 2006
I found out this morning that Physical Plant has will be shutting off the air conditioning in the Hamilton Hall server room from 7:30 am to 11:00 am on Thursday. It may be necessary to shut down bayes, freesurface, bluespruce and space in order to prevent overheating.
December 15, 2006
Mathserv crashed and was down for six minutes at ca. 9:30 this morning. This does not appear to be related to the problem on Wednesday evening, which was a network and I/O strain which left mathserv up but all but almost unresponsive.
November 27, 2006
Mathserv was rebooted at 11:45 am today after it became unresponsive following some fifteen minutes of very high sustained network load. The server and services (email, Web, home directories, printing) were unavailable for ca. six minutes. Email will have been queued for delivery and most msprime workstations should have become responsive again when mathserv came back up.
I will be reviewing the logs to determine the source of the problem.
November 15, 2006
A few people have had problems with the settings of their desktop manager (i.e. KDE, Gnome, WindowMaker) after upgrading their systems to Mandriva 2006.0. The quickest solution is to reset your desktop-manager settings to the default.
The msprime workstations - i.e. the Dell GX270 linux workstations used by graduate students, post-doctoral fellows and some faculty members - have been running Mandrake Linux 10.1 for two years now. Mandriva Linux 2006.0 has been deployed and tested on a dozen or so systems for several weeks and is now ready for general deployment.
When you next reboot your workstation, it should come up running Mandriva 2006.0. If you find that any applications are misssing, please send email to firstname.lastname@example.org.
November 3, 2006
Mathserv was inaccessible this morning from 8:20 am to 9:00 am due to a network problem; web, email and workstation access were all down during this period. All mail was queued for later delivery and workstations will have started working again without reboots once mathserv was back on the network.
Note that we know the source of the problem (which caused a failure in the Summer, too) and have been waiting on hardware replacement; the new hardware arrived this week and will be deployed soon.
October 10, 2006
As announced, mathserv was rebooted just after noon; total downtime was just under four minutes. I hope experience wasn't too traumatic for anyone; my apologies for the short notice.
I will be performing an emergency reboot on mathserv at 12:05 today. Web, email and workstation access will be interrupted for about ten minutes.
The msprime linux workstations will lock up when mathserv goes down but will respond again as soon as it is up; you do not need to reboot.
August 25, 2006
Mathserv will be down part of the evening of Wednesday, August 30th in order to complete upgrade work which was postponed this past Wednesday.
Email and the workstations in HH and T13 will be unavailable while mathserv is down; a read-only copy of the web sites will be up on a backup server.
August 19, 2006
Mathserv will be down part of the evening of Wednesday, August 23rd. Email and the workstations in HH and T13 will be unavailable while mathserv is down; a read-only copy of the web sites will be up on a backup server.
Mathserv will go down at 6:30 pm. If all goes according to plan, it will be up again whithin the hour; it's quite possible, however, that the server and services will be down for two or three hours.
Mathserv was bogged down for about an hour on Saturday afternoon (between 4:30 and 5:30, roughly). Web access, logins and workstation responses were erratic or nil during that time.
August 4, 2006
Freesurface was rebooted this afternoon after it ground to a halt with I/O problems. This was a first for this server, was has chugged along quite reliably, often under heavy CPU load. I will be keeping an eye on it.
Total down time was ca. 20 minutes. Once one job appears to have been interrupted.
June 20, 2006
Spam filtering is working again.
Recall that spam filtering is not automatic; you need to activate it for your account. In brief: just put these lines into the file .procmailrc in your home directory:
### spam assassin
SPAMTO=Spambox # keep in Spambox
#SPAMTO=/dev/null # remove leadng # to discard
### end spam assassin
More information here.
June 19, 2006
June 18, 2006
June 15, 2006
Bayes is now running Mandriva 2006.0. All software installed under the previous OS has been carried forward and should work as before. Please email email@example.com if you encounter any problems.
June 13, 2006
The bayes upgraded scheduled for Monday June 12th has been resecheduled for the afternoon of Thursday June 15th.
June 1, 2006
The MATLAB license servers are not allowing new MATLAB sessions due to an invalid license file supplied to us for a new toolbox today. We are working on it.
Update: working again as of 16:40.
I plan to take bayes off line for the afternoon of Monday, June 12th in order to upgrade the operating system. Please email me at firstname.lastname@example.org if this will be a particularly bad time for you.
May 18, 2006
Mathserv suffered a crash related to a networking overflow on Wednesday at 9:32 pm. It was brought back up by another analyst at 9:30 am this morning (I am away this week). This is the second such failure on campus in a week; foul play is a technical possibility and we are investigating.
All mail was queued for delivery once mathserv recovered. msprime workstations should not need to be rebooted. There is no evidence of data loss.
May 3, 2006
UTS has arranged to keep the network in Hamilton Hall up this Sunday. We will not lose Internet access to and from HH; I won't be moving the web sites to the backup system in ABB; email will not be interrupted.
April 26, 2006
We've been informed that Hamilton Hall will be disconnected from the network this Sunday morning:
Technology Services, Enterprise Networks, has scheduled a network service interruption for Sunday May 7th, to carry out maintenance on the fibre plant. This work is necessary in order to upgrade the networking in Residence (MacOnline) later this Spring.
April 24, 2006
March 10, 2006
I have reconfigured the msprime workstations in order to improve the server performance. Every system will have to be rebooted in order for the changes to take effect.
Please logoff before you leave today; your workstation will reboot automatically within half an hour. If you computer has not rebooted by Tuesday, I will arrange a time to reboot it manually.
Please let me know if a reboot will interfere with any long-running calculations (be sure to mention the name of your workstation).
March 2, 2006
I will be shutting mathserv down on Monday, March 6th at ca. 7:30 am; it should be back up by 8:15 am.
January 31, 2006
I was able to keep all machines but mathserv2 (the fail-over server) running through this morning's ventillation shutdown since the AC was off only for part of the announced period.
Mathserv was unavailable via ssh for between 8:30 and 9:00; other services (web, mail, workstation) were unaffected and this was not related to the ventillation shutdown.
January 27, 2006
Physical Plant has announced that they will be shutting off the ventillation systems in Hamilton from 6:00 am to 8:00 am on Tuesday (Janurary 31st). I will be shutting down the following systems in the server room before 6:00 am in order to avoid overheating:
I will shut down the following systems only if the room becomes too hot:
January 25, 2006
The changes to the 'math' wireless network announced last week have been implemented. The signal is now concentrated at the Western end of the building and the SSID (network name) is no longer being broadcast. If your laptop doesn't connect automatically (OS X laptops should continue to do so), enter the SSID 'math' manually to connect.
The changes to the 'math' wireless network announced last week have been implemented. The signal is now concentrated at the Western end of the building and the SSID is no longer being broadcast. If your laptop doesn't connect automatically (OS X laptops should continue to do so), enter the SSID (network name) 'math' manually to connect.
January 19, 2006
The 'math' network has been rendered more or less redundant by the MacConnect network and UTS has asked that remove the temporary network in order to reduce interference.
January 9, 2006
The machines were turned back on at ca. 10:30 am when the room had cooled enough to take the extra systems.
Now I'm told that the ventillation will go down tomorrow am from ca. 5 - 7. The machines can easily overheat - which could result in file-system damage, even hardware damage - before two hours without ventillation. I'm going going to shut down everything but mathserv, bayes and freesurface this time. Redpine, spruce, modelmath and mathserv2 will shutdown at ca. 4:00 am.
All of the systems shutdown on Saturday are still turned off as the air conditioning is not yet back on. I have contacted Physical Plant.
January 5, 2006
Physical Plant will be shutting down the ventilation system in Hamilton Hall on Saturday, January 7th, which means that there will be no cooling in the server room. In order to prevent overheating, I will be shutting down all systems except for mathserv on Friday evening; the systems will be restarted on Monday morning. There will be no interruption to web, mail or workstation service.
January 4, 2006
January 2, 2006
I have updated the system and application software on the msprime systems in Hamilton Hall to the latest releases for Mandrake 10.1; I recommend restarting your computer when convenient. These updates are mostly bug fixes and security patches.
Updatee systems and applications include mozilla, X11, vi and xine.
December 21, 2005
The five minutes of lost network access to mathserv scheduled for 8:00 am today ended up requiring a server reboot and about eight minutes of total downtime. Mathserv was back up at 8:30 and is now operating with a faster network connection.
December 17, 2005
The departmental servers and workstations were back up at 6:45 pm. The department web site was down between 7:00 pm and 8:45 pm due to a DNS delay.
All the servers are now in the new rack and power can now be divided between the current and soon-to-arrive UPS without general service interruptions. The msprime workstations will have picked up where they left off before the server went down at 10:00 am.
My thanks to everyone for their patience during this extended - but necessary - downtime today.
December 16, 2005
As announced in November, mathserv and the other servers in the Hamilton Hall server room will be unavailable much of tomorrow.
The computers will be powered off shortly after 10:00 am; they should be back up for good in the early afternoon.
The msprime workstations in Hamilton Hall don't need to be shut down; when mathserv comes back up, they should pick up exactly where they left off.
For more information, please see the original announcement.
In addition to the servers originally listed, these will also be down tomorrow: modelmath, space.
December 14, 2005
Note that this is not at all related to the over-heating problem in the server room, nor to the network problems that a very few people in HH experienced last week.
As of 2:00 pm, all msprime stations appear to have lost access to the external Internet. The problem appears to be progressive, though many people do still have off-campus access.
At 5:50 pm, I heard from UTS that the problem was fixed:
Problems affecting access to and from off-campus networks yesterday afternoon were rectified and were traced to a denial of service attack originating in one of the student residences. The unusual nature of this particular attack consumed resources on the campus firewall, preventing normal traffic flows from being established. Measures have been taken to prevent this specific type of attack in future.
An unannounced consequence of the (announced) mould-removal work on the first floor of Hamilton Hall is that the building fans have been shut down, which means that there is no air conditioning in the server room. The temperature in the server room was high enough to cause equipment failure by late morning; in fact, one compute server had already crashed.
In order to reduce heat output, I have shut down all non-production systems (freesurface, bluespruce, mathserv2). Spruce and redpine were idle and have been shutdown, as well. I may have to shutdown non-essential servers (bayes, space, modelmath) at very short notice in order to keep the temperature low enough for mathserv to stay up.
I've been told that the air conditioning will be restarted when work on the mould removal stops at ca. 2:30 today.
November 11, 2005
November 8, 2005
November 7, 2005
November 2, 2005
October 24, 2005
The HH-303 printer was inaccessible since Friday evening; it's working as of 9:30 am Monday.
The printer and workstation queues were actually fine all along; the problem was to do with the printer's network connection.
October 13, 2005
Physical Plant has announced that the power will be shut off for ca. ten seconds in Hamilton Hall at 6:00 am on Wednesday, October 19th.
I will schedule an automatic shutdown of the mspime workstations as well other RHPCS-administered unix (incl. OS X) machines. Please logout before you leave on Tuesday.
If you adminster your own system (this means all Windows users as well as some Macintosh and unix users), I highly recommend turning your computer off before you leave on Tuesday.
The servers will not be affected directly since they are on a UPS, but they will not be accessible while the power is out because the network will not work without electricity.
September 22, 2005
Macintosh OS X users can now mount home directories and web sites from mathserv.
September 4, 2005
July 22, 2005
June 19, 2005
June 16, 2005
June 13, 2005
June 8, 2005
Due to construction and renovations, power will be shut off in Hamilton Hall and T13 on Saturday, June 18th. HH will be powerless from 8:00 am to noon; T13 will have no power from 8:00 am to 8:00 pm.
June 3, 2005
May 27, 2005
I've been able to mount the crashed file system and I've found no evidence of any damage to the mail, home-directory and web-site files created after the Friday am backups and before the Saturday am crash. These files are now available to you.
May 25, 2005
People have encountered a number of problems in the wake of the recovery from the weekend server crash; there are listed here in order of descending priority.
May 24, 2005
Mathserv crashed at 4:07 am on Saturday May 21st. The new fail-over server is now running as mathserv and all services are up as of Monday afternoon. Some email and data files are missing temporarily; I expect to have any missing data restored before Thursday. We may have to schedule some off-hours downtime in the coming weeks to complete the installation of the new fail-over configuration.
May 9, 2005
The will be a brief power interruption in Hamilton Hall on Thursday, May 12th at 6:30 am.
The msprime stations and other managed unix and Macintosh systems will be shut down automatically at 6:00 am. If you manage your own system - which is the case for all Windows PCs - I suggest that you power it off before you leave for the weekend.
The servers will not be affected by the power outage directly, though they will be inaccessible while the network has no power.
April 22, 2005
UTS (formerly CIS) has installed MacConnect wireless in Hamilton Hall. Our temporary AirPort network (called 'math') will be removed on July 1st, 2005. Unlike the 'math' network, MacConnect is available to anyone with a MacId account (i.e. faculty, staff, graduate students and undergraduates); this is the same network which is available in the libraries, the student centre, and a growing list of other buildings.
I emphasize that you will need a MacId and VPN software to use the MacConnect wireless; see the MacConnect web site for more information.
The laptop/wireless section of the computing-resources web site will be updated in the next week or so.
April 13, 2005
There were problems sending mail via mathserv late Tuesday night after the system logs filled the main disk with warning messages. The source of the problem is being investigated and we have a work-around in place at about 8:00 am Wednesday.
This is most likely fallout from the disk crash on Monday.
April 12, 2005
OpenOffice isn't starting up on the msprime stations; I don't yet know why. A simple reinstallation is failing, too.
For now, please use gnumeric to open Excel or OpenOffice Calc spreadsheets.
Abiword will be available for Word and OpenOffice Write documents later on this evening.
Abiword now available; type abiword from a command prompt or Alt-F2.
Mathserv crashed at 1:02 pm on Monday, April 11th. It was back up again at 1:15 am with no data loss.
Most msprime workstations are picking up where they left off, though some may require a reboot.
The cause of the crash was a disk failure on one of the two RAID 5 arrays. "Aren't RAID 5 arrays supposed to be able to handle a single disk failure?" you ask. Yes; yes they are. But this is the second time since January that this server has failed to do so. I think that I have identified and solved a hardware problem that could have precipitated both this crash and the one in March.
Note that the department web site (including course pages) were only down from 1:00 pm to ca. 2:30 pm. I used the 2:00 am backup copy to bring the sites up on our backup server.
Plans were underway to reduce the likelihood of this sort of failure as well as the downtime should it happen anyhow; I am accelerating the project.