Computing History
The use of computers in science has a very long history in Cambridge. From the early 1950s the central University computer was heavily used in all sorts of scientific fields. This page serves to complement the more general page on TCM's history by recording computing in TCM from the time when TCM itself had hardware for running significant calculations, rather than merely hardware for accessing remote facilities. It does not attempt to record the external national and international facilities individuals in TCM used (e.g. various Crays at Rutherford then Edinburgh).
Early University Computing
The first computer at the University of Cambridge was EDSAC 1, a computer designed and built by the University's Mathematical Laboratory. It ran its first program in May 1949, and was one of the first computers in the world to store its program in memory, rather than with hard wiring. It was also unusual in being designed to be used by scientists. Scientists from many disciplines used it, but for this history particular record should be made of calculations on the band structure of aluminium contained in Prof Volker Heine's PhD thesis (submitted June 1956) for which he used EDSAC 1.
EDSAC 1 was replaced by EDSAC 2 in 1958, which was in turn replaced by TITAN in 1964/1965. TITAN was the first computer in Cambridge to support high-level languages, such as Fortran. In 1970 the Mathematical Laboratory divided to become the Computer Laboratory and the Computing Service, the latter now being called the UIS. In 1971 an IBM 370/165 ("Phoenix") was installed. This was upgraded twice, to an IBM 3081D in 1982, and then finally to a 3084Q in 1989, before being decomissioned in 1995.
Most of these systems considerably pre-dated the first national facility available to TCM, that at Daresbury, which started with a Cray 1 around 1980. Prior to that its computers had been available to support only the experimental programme associated with its particle accelerator.
Floating Point Systems
In 1985 two FPS-264 machines arrived in TCM, necessitating the installation of air conditioning and a halon fire extinguishing system in room 522. The FPS-264 required a host computer, and TCM's were attached to a small VAX.
The FPS-264 had a clock speed of 18.5MHz, and a peak performance of 37MFLOPS. FPS itself was a company based in Oregon which had started selling computers (or co-processors) in 1976, and was bought by Cray in 1991. The 264 which TCM acquired was the company's first product using emitter coupled logic, rather than TTL logic, which enabled the clock speed to be raised significantly from the previous 5.5MHz. It remained binary compatible with the earlier TTL FPS-164. I assume it also weighed much less than the 800kg of the FPS-164, else I do not see how two fitted in room 522.
This marked the start of TCM having a significant local compute resource, rather than being wholly reliant on University and National facilities.
The RS/6000 era
By the early 1990s, VMS had made way for UNIX, and there were ten UNIX machines in TCM. Five were DECStations. These had graphical displays, supporting a resolution of 1024x768 pixels and 256 colours, and used MIPS processors running at around 25MHz, and would have had a performance of under 10MFLOPS. Four had 16MB of memory, but one, tcm9, was expanded to 40MB. They ran ULTRIX, a rather odd varient of UNIX which did not support shared libraries. Two of these machines were in room 517, which was a "public" computer room. The others were in the offices of very lucky students and postdocs.
Most offices had a single text-only terminal, connected to the rest of the world via a serial line. Some of these terminals were from the traditional VT100 series, and others were BBC model B microcomputers with suitable ROMs in their expansion slots. Yet others were unwanted Macintoshes, mostly with integrated nine inch black-and-white (not greyscale!) displays with a 512x342 resolution. Even the Group's laser printers were connected via serial lines, and printing even a 1MB file over a 9,600 baud (1KB/s) line took a very long time.
The bigger jobs were run on five IBM RS/6000s. These had no consoles at all, lived in room 522 where the FPS 264s had been, and were accessible via a queueing system (NQS) only. They had up to 64MB of memory, and ran AIX, IBM's proprietory version of UNIX. The POWER1 processor they contained was one of the first to be able to execute two independent floating-point instructions per clock-cycle, although its clock-speed was modest. My memory says that TCM had the 41.6MHz version.
TCM also used national facilities, such as the Cray Y-MP/8 at Rutherford.
The Alpha era
DEC launched the Alpha processor in 1992, and, in the Summer of 1993, TCM installed its first DEC Alphas. This new processor offered 64-bit operation, and clock speeds of over 100MHz. Within a year TCM had purchased five of these, which were now the most powerful machines in the Group. Unlike the IBMs, they were configured with graphical consoles and without a queueing system. The Alpha could issue only a single floating-point instruction per clock-cycle, so the performance increase was not huge. However, TCM's versions ran at 133MHz or more, and had 96MB of memory, so were certainly superior to our RS/6000s.
Memory was still expensive, and buying 96MB models cost a lot more than the standard 32MB. One (tcm14) was purchased with 288MB, which cost around £20,000 for the memory alone. The machines may have been hot and noisy, but were still welcomed by many into their offices for having a graphical console, with huge 1280x1024 pixel CRT displays, in some cases 21" displays weighing over 30kg.
In the end TCM purchased ten of the original DEC 3000 AXP range, the most powerful being a 175MHz 320MB version which could get close to 100MFLOPS on the Linpack benchmark. They ran Tru/64, another DEC version of UNIX, but rather more standard than Ultrix had been.
TCM continued to purchase the later generations of Alpha too. The next generation offered higher clock speeds, of around 200 to 300MHz, and moved to the standard PCI bus for peripherals from DEC's proprietory Turbochannel used on the DECStations and 3000 AXP range. By this point memory was sufficiently cheap that we were able to buy with 256MB fairly routinely.
The EV5 generation of Alpha saw a significant performance improvement: it could now issue a floating point multiply and an addition on the same clockcycle, just as the IBM POWER1 could. It also had an improved cache design compared to the previous Alphas. Microsoft produced a version of Windows NT for the Alpha (although TCM never used it), and there was evidence that an EV5-based machine could outperform any Intel PC on many Windows benchmarks. For floating point work the difference was significant. By mid 1996 the EV5 had reached 433MHz, whereas the fastest Intel CPUs were still under 250MHz and could issue just a single floating-point operation per cycle, not the Alpha's two.
Whilst a 266MHz Alpha was certainly not as fast as the national Cray J90, many in the Group felt that the amount of computing we could do on our multiple Alphas easily exceeded that we could do on the small fraction of the J90 available to us.
Cray too was impressed by the Alpha processor, and its parallel machines the T3D and T3E both used it. The national facility to which some in TCM had access had a T3D.
Central UNIX systems
The central University computers listed at the top of this page did not run UNIX. The first widely-used central system in Cambridge to do so was CUS, the Central Unix System, which consisted of three, later four, Sun computers.
The arrival of CUS, and the subsequent decommissioning of Phoenix, left an empty machine room. This was filled by the High Performance Computing Service (HPCF, later CC-HPCF, then HPCS, then CSD3). The HPCF was not part of the central Computing Service, but was formed by a consortium of scientific and engineering Departments, with considerably input from TCM. Its first machine was a Hitachi S3600 vector computer, the most powerful air-cooled vector computer available from Hitachi. Although Phoenix had been water-cooled, it had disgraced itself on a couple of occasions and it was made clear to the HPCF that further water-cooled machines would not be welcome.
The S3600 was joined almost immediately, also in 1996, by a Hitachi SR-2201, a parallel machine with 64 CPUs initially, but quickly upgraded to 96. TCM funded a sufficient portion of this machine to be given a dedicated partition of sixteen CPUs, a unique privilege. The SR-2201 was upgraded to 224 CPUs in 1997, then to 256 CPUs in 1998 before being decommissioned. It was listed in the first 100 of the global Top500 supercomputers list from November 1997 to November 1998 inclusive.
Subsequent machines run by the HPCF were an Origin 2000, an IBM SP (Power3, then upgraded to Power4), and finally a cluster of SunFire15Ks, run in conjunction with Cranfield University, before it moved to using clusters of commodity PCs. All were much used by TCM.
The HPCF, now CSD3, is now part of the UIS. There is an archive of details of early HPCF machines.
The Alphas were very nice, but also very expensive, costing around £20k in the configurations TCM was purchasing. The PhD students were increasingly wanting access to terminals capable of supporting graphics, and it was clear that Alphas could not be afforded in sufficient numbers to provide this. So in 1995 a group of PhD students set themselves up to purchase cheapish PCs (with the Group's money!) to run Linux and to act as little more than dumb terminals, but also to run editors, LaTeX, gnuplot, etc locally. The first such machines were Pentiums running Slackware.
This venture was quite a success, and soon a significant number of Linux PCs were being run in TCM by the PhD students. Although the Group had a Computer Officer, he refused to touch Linux, or, indeed, to install public domain software not distributed by DEC on the Alphas. Again the PhD students successfully lobbied to be allowed to maintain a central directory of public domain software which they curated, which contained such essentials (then) as LaTeX, gnuplot, xmgr, and other similar software. Initially Slackware was the Linux distribution used.
It soon led to the Group's first web server appearing. The precise date is lost in the mists of time, but a snapshot on the Wayback Machine shows usage data starting in December 1995, with 313 access in the week beginning 3rd December when, according to Netcraft's data, there were fewer than 80,000 sites on the web. It was set up and run by a PhD student, and was not adopted by a staff member until mid 1998, by which point it was serving over 6,000 pages per week. It offered all users of TCM's computers personal webpages from the outset.
On 4th March 1999 the internet news website Slashdot published a brief comment pointing to work done by Thomas Fink and Yong Mao on tie knots, linking to their page on TCM's website. Before this point the server's busiest week had seen 12,500 pages served. In the week beginning 28th February 1999 it served over 88,000 pages, despite spending part of that week crashed due to load, leading to the server being reconfigured to be more resilient to load spikes.
It was also realised that whilst the Alphas were excellent at heavy numeric work, for programs involving mostly integer operations PCs could be competitive. Especially competitive if the cost of a PC software licence was cheaper than that of an Alpha licence (as was the case for Mathematica).
More Alphas
Whilst the Alpha hardware was quite expensive, the software support was cheap. A contract costing £3 per machine per month entitled one to run the lastest version of the OS, and its compilers (C/C++/Fortran), and vendor-optimised maths library. The compilers were quite good too, so much so that Intel acquired most of the DEC/Compaq compiler team in late 2001 and merged technology from the DEC compiler into what is now ifort. Even in 2019 the ifort man page states that "decfort_dump_flag is an alternate spelling for FOR_DUMP_CORE_FILE."
TCM continued to purchase Alphas, through the EV5.6 generation of PW500au machines, and then into the EV6 generation of XP1000s. The EV6 had the same theoretical peak performance as the EV5, but much improved caches, and other architectural improvements, leading to much better performance in practice. A 667MHz XP1000 was the first machine in TCM to achieve 1GFLOPS on Linpack 5000x5000, and the first to achieve 1GB/s on Streams. A 500MHz XP1000 was the first machine in TCM to have 1GB of memory, and later TCM had some with the maximum of 2GB. Unfortunately the Alpha processor ceased to be developed, and so TCM moved from Alphas to PCs. However, the slowing of Alpha development meant that TCM was already performing this migration, and it never bought any of the last range of Alphas, which used the EV6 running at 1GHz and with faster memory.
The last Alpha to serve as someone's desktop in TCM was retired in 2009, and our last Alpha decommissioned in 2013. (A small number of people had continued to value them for their compilers and tuning utilities long after their absolute performance had become uncompetitive.)
The irony for this trend-setting 64 bit architecture was that none of the three dozen uniprocessor Alphas passing through TCM's hands was capable of using more than 2GB of memory.
Other RISCs
TCM used other RISC machines besides Alphas. When the Cambridge HPCF had a large SGI Origin 2000, TCM acquired a dual processor SGI Octane for local code development. Prior to this it had also had an SGI Indigo2. And when the Cambridge HPCF moved to using SunFire 15Ks, TCM also acquired a couple of SPARC-based machines, a 280R and a V480.
And, when Apple launched the first PowerPC-based Macintoshes in 1994, TCM acquired a pair. It was soon realised that, for TCM's purposes, they were quite unsuitable, being expensive and with poor software support.
TCM also had the pleasure of hosting a donated Intel Itanium 2 based computer for many months. Not RISC, not CISC, but VLIW.
The PC era
Two things caused the Group to move decisively to using PCs as its main compute platform. The Pentium 4, introduced in 1999, at last had decent memory bandwidth and floating point performance when compared to the RISC alternatives. However, it was still 32 bit, and a 4GB limit on memory use was going to be limiting. But then AMD produced the Opteron in 2003. This not only moved to 64 bits, but also represented a large increase in memory bandwidth for multi-CPU designs, as it had one memory controller per CPU in a NUMA architecture. Perhaps one should also mention the improving compilers for Linux-based PCs, both GCC and Intel's compiler suite, without which PCs would have been useless for running code developed in TCM.
In 2004 we purchased a four CPU machine with 8GB of RAM running on 64 bit "PC" hardware, i.e. Opteron CPUs. I do not believe that we have since purchased any non x86_64 CPU for compute work, with single CPU machines on the desktop, and mostly dual CPU machines as compute servers. It took until 2014 before the last 32 bit PC was eliminated, and thus we could use precisely the same binaries on all of our PCs, and not have to install two versions of most software.
TCM had quickly abandoned the Slackware Linux distribution in favour of RedHat. The abrupt discontinuation of the free version of RedHat caused TCM to move first to SuSE Linux Enterprise Server, for which the University had a site licence. When that licence disappeared, it moved to OpenSuSE. When the support policy of OpenSuSE became incompatible with a major upgrade of the OSes on our PCs every other Summer, it moved to Ubuntu LTS.
PC Clusters
When TCM moved its server room from 522 to 527, the intention was to revert 522 to being an office. However, Richard Needs soon expressed a desire to purchase a PC cluster. So from 2003 room 522 housed a collection of sixteen dual socket 32-bit Pentium4 machines with 2.4GHz single-core CPUs and 1GB of memory each.
Late in 2009 this cluster was replaced by one based on dual-socket nodes, this time from Sun. Although the clock speed of the new nodes was little changed, at 2.66GHz, the quad core Nehalem CPUs were much faster than the single-core Pentium4s. Though the new cluster had just half as many nodes as the old, its peak performance was about five times higher.
This new cluster completely replaced the original. Since then it has been upgraded incrementally. First four new nodes, each containing a pair of ten core 2.6GHz Haswell processors were added in 2015. Then in 2017 four of the Nehalem nodes were replaced by four nodes each containing a pair of Xeon Gold 6126 CPUs (twelve cores per CPU), and the head node was replaced at the same time. Finally in 2019 the remaining four Nehalem nodes were replaced by a single Xeon Gold 6226 node (which offered more performance for less electricity, weight and space).
So 522 is now known as the "cluster room", although it does not only house the cluster. It also contains various other headless compute servers, in a tradition going back to the days of the FPS 264s and including a couple of SPARC-based Suns before the move to x86_64. Limits on power and weight mean that really big computers cannot be housed here.
TCM's computers have always used network-mounted home directories, so that one sees the same home directory no matter what machine one is logged in to. However, in the 1990s it did not purchase any dedicated servers, but relied on various of its compute machines doubling up as servers with external disks, generally of 2GB or 4GB. A lot of centralised applications were also provided in this fashion. Default home directory quotas have always been small, but in the early nineties they were merely 10MB -- less than a box of floppy disks.
The result was quite difficult to back up, as important data were scattered, and it was difficult to predict what would break if a given machine was rebooted, overload with a job, or failed. In late 1999 TCM at last centralised, acquiring a single server, with attached DLT7000 tape library, for home directories, central applications, and the password file. This machine was an AlphaServer DS20, and cost just under £40k including VAT and tape library. It had 1GB of memory, two 500MHz CPUs, and five 36GB disks for data.
The migration to the new machine was more rapid than originally planned. The Physics Department had been providing a backup service to TCM. On 15th February 2000, it announced that, due to a Y2K issue with the backup software, backups "should be considered to have stopped as of 31 December 1999. A rapid improvement of this situation is unlikely."
The new fileserver co-incided with another change in policy. Since its introduction, TCM has ensured that the sum of the quotas given to users on any disk is less than the capacity of the disk. Before this point, disks were "over-quoated", and it was common for them to fill. So, even if one was under one's personal quota, one might stil be unable to save any work.
The DS20 had a fairly short life, being replaced in 2003 with a SunFire V240, or, more accurately, a pair of SunFire V240s, an external disk box containing twelve 72GB disks, and an SDLT320 tape library. Each V240 had 2GB of RAM and two 1GHz UltraSPARC IIIi CPUs. This cost just over £30k. It moved us away from the Alpha architecture, which was clearly dying by this point, and offered much improved backup and network speeds, and greater capacity. Sun's Solaris operating system was widely used in Cambridge at the time, and it looked like the direction TCM would move in, just as the HPCS had with its purchase of several million pounds worth of Sun machines (SunFire 15Ks).
Many factors caused this trajectory to change, and, by the time this machine was finally replaced in 2015 it was very hard to find anyone in Cambridge with current knowledge of Solaris, and Sun itself had been purchased by Oracle in a deal announced in 2009. It was replaced by a pair of Intel servers, using disks, not tapes, for backup. Disks make backups quicker, and user recovery of files much quicker and simpler. They also work well with simple software such as rsync, rather than commercial software which can fail due to licence key issues (amongst other things).
An SDLT320 tape cartridge contained 1,800ft (550m) of half inch wide (13mm) tape. The tape moved at just over 10ft/s (3m/s, 6.5mph), and the head wrote eight tracks in parallel for the full length of the tape, moved slightly vertically, and then wrote another eight tracks in the opposite direction, until 448 tracks were built up across the width of the tape, giving a density of just over a thousand tracks per inch, and 190kbits/inch along each track. The reliability was remarkably good, but also good was the potential for chaos with a third of a mile of tape ready to be thrown out at many feet per second. Each cartridge contained a single reel, with the other reel in the drive, and a mechanism for catching the free end of a tape and winding it on to the internal spool. The drive needed to be very co-operative for an unload operation to work, for typically several hundred feet of tape were wrapped around the spool internal to the drive.
Disk-based backups were faster, but still far from instantaneous. In 2016 moving the local backup from one RAID array to another took over eleven hours to create the new array, and over eleven hours to copy 19 million files. Once this was done, the default home directory quota was increased from 300MB to 768MB.
In the early 1990s the Group had two black and white PostScript laserprinters, connected to the network via serial lines running at about 1KB/s. A large document, or complicated graph, could take ten minutes simply to transmit to the printers!
These were then supplemented by a colour inkjet printer connected via ethernet. Text from the inkjet printer was not as sharp as from any of the laser printers, so theses tended to be printed in black and white, and then any pages containing colour images printed separately. At about this time the black and white printers moved to being connected by ethernet, which, in those days was 10MBit/s (1MB/s), an enormous improvement on 1KB/s.
Automatic duplex printing was not possible, and University regulations at the time prohibited PhD theses from being printed duplexed anyway!
Then the Group purchased its first colour laser printer, an Apple. This was later replaced by an HP model. This had its colour cartridges in a carousel, and printing a colour page was a slow, four-pass operation with the carousel rotating noisily between each pass. Its maximum speed in colour was four pages per minute, about the same as the Apple.
In 2003 TCM expanded its printing capabilities with the purchase of an A3 colour laserprinter, and an A0 inkjet poster printer, an HP DesignJet 1055. This printer somehow kept running until 2018 when it was replaced by a rather superior model offering much better print quality, and the ability to print on fabric. (Fabric can be gently folded into a suitcase in a way that paper cannot.) The original printer was becoming increasingly unreliable, and the shock of being moved caused it to fail completely.
So since 2003 we have been able to offer members of the Group a self-service poster printing service.
A very ephemeral sort of "printing" is the image produced by a seminar room projector. TCM acquired its first in the Summer of 2000, and it cost an impressive £3,900 despite offering a resolution of merely 1024x768. It was replaced by one of similar specification which lasted until 2013, when we upgraded to 1920x1080.
Before this era the alternative to the blackboard was the overhead projector, and either handwritten slides (possibly in real time), or laser-printed slides. The latter needed care, for the cheap A4 acetates used for hand-written slides would soften in a laserprinter, stick to the fuser unit, and cause a few hundred pounds worth of damage. Acetates which could survive printers (and photocopiers) cost almost ten times as much as the standard ones. There was also a 35mm slide projector, which almost no-one ever used, partly because there was no obvious way of producing 35mm slides.
In the early nineties a mixture of networks served TCM's offices. There was some ethernet over UTP, just as there is today, only then it ran at 10MBit/s. There was a small amount of 10Base2 ethernet, which was 10MBit/s over co-axial cable. It resulted in a single cable snaking around various offices, and any breaks stopped the whole network and could be difficult to trace. However, the text terminals, printers, and Apple desktops (mostly used by the staff) were connected by serial links running mostly at 9,600 baud, i.e. 1KB/s.
The first improvment was the conversion of serial links to ethernet. However, the ethernet was mostly repeated, rather than switched, so many machines shared the 10MBit/s bandwidth. It is memorable that the Department, having misunderstood a licence, had a single copy of the NAG libraries on a central fileserver. If one tried to link against this library at a time when the Departmental network was busy, it could take five to ten minutes.
In 1999 TCM decided to upgrade its network to 100MBit/s ethernet, and fully switched. The Department was not interested in doing this more centrally, so TCM started to become more responsible for the financial and staffing costs of its local network, which previously the Department had provided.
It soon discovered that much of the Department's twisted pair wiring was incorrect. The point of twisted pairs is that each pair carries a signal, with each wire carrying equal and opposite currents, and provided the wavelength of the twist is a lot shorter than the wavelength of the signal, there is almost no dipole radiation. For ethernet this requires that pins 1 and 2 get one pair, 3 and 6 another, 4 and 5 another, and 7 and 8 the last. The Department had failed to train its electricians on this point, and both patch cables and the building's wiring tended to be the more logical 1 and 2, 3 and 4, 5 and 6 and 7 and 8 as the pairs. Attempts to run cables like this of more than a few feet at 100MBit/s produced a complete loss of connectivity, for ethernet does not reduce its transmission speed if the physical connection is poor.
So in 2002 TCM ended up having to rewire the whole of the top floor of the Mott Building so that it could use its new 100MBit/s hardware. This led to 527 becoming the new server room, so as not to disturb room 522 during the work. Again, the Department was not interested in doing this more centrally.
With the Group's network working well, the main source of annoyance was the unreliability of the Departmental network to which it still connected. After some time using various forms of "storm control," so that issues on the Departmental network left the Group's network functional but isolated, rather than completely non-functional, the Group decided to move to having its own direct connection to the CUDN. This, achieved in 2003, also greatly improved security, and was recommended for that reason by the UCS. It also meant that, de facto, IT in TCM was independent of the rest of Physics.
Of course, once the Department saw how happy TCM was with its new network, it did arrange for the rest of the laboratory to be rewired and upgraded using central funds and staff, and did teach its electricians how to wire UTP correctly. And in 2009 Physics decided to split its network into smaller pieces too, partly in order to simplify its management.
TCM embraced the University's wireless service in February 2007. Concerns that it might interfere with experimental work were thought to be unfounded due to the prevalence of mobile phones, microwaves, and ad hoc WiFi devices in the building, and also the discoveries about the incorrectly wired UTP cables in use a few years earlier. TCM of course offered that it would remove its WiFi points if they did cause an issue, but it saw no reason to delay the deployment. Its independence from Physics enabled it to make this trial alone. No issue was reported, and Physics started to deploy Lapwing about eighteen months later.
Finally in 2016 the old Cisco network hardware purchased in 1999 was replaced, and TCM moved from 100MBit/s to the desktop with 1GBit/s links between switches and to important servers to 1GBit/s to the desktop and 10GBit/s for high-capacity links. In 1999 it had been necessary to use optical fibre for the 1GBit/s links, but in 2016 it was possible to run 10GBit/s over UTP. The new network also offered improved management features, whilst saving several hundred watts of power.