When technology fails on a large scale, we are reminded of the immediacy and seriousness of the potential damage. As most people know, the fallout from technology failures has taken many forms including misguided spacecrafts, network outages, and flawed early warning systems. Each IT “mega-disaster” should give IT professionals pause to reconsider the causes and effects of potential failure scenarios, disaster recovery planning, system backups, and testing. Following is a reminder of how catastrophic IT failures can be.
Soviet Early Warning System Failure (1983)
On September 26, 1983, Stanislav Petrov was on duty at the Oko nuclear warning command center in the Soviet Union when the early detection system malfunctioned indicating that the United States had launched up to 7 nuclear missiles. Fortunately, Petrov presumed that the warning was a false alarm. His decision to treat it as such avoided a retaliatory nuclear attack on the U.S. and its NATO allies from the Soviets and a corresponding response from the West.
This incident is especially compelling because it occurred a mere three weeks after the Soviet military shot down Korean Airlines flight 007, which significantly increased tensions between the U.S. and Soviet Union. Flight 007 was on the last leg of its journey from New York City to Seoul with a stopover in Anchorage. Shortly after taking off from Anchorage, the plane veered off course and entered Soviet airspace over the Kamchatka peninsula where top Soviet military installations were located. A Soviet fighter jet launched a heat-seeking missile after alleged attempts to contact the Korean airliner went unanswered. The missile caused the plane to crash into the Sea of Japan, killing all aboard.
Twenty-two years after the incident (2006), the Russian Federation indicated that nuclear retaliation would have required multiple sources confirming an attack. Even so, the incident revealed a serious flaw in the early warning system, which the Russian Federation eventually confirmed. The false alarm in 1983 resulted from a rare alignment of sunlight on high-altitude clouds and the satellites’ orbits, an error later corrected by cross-referencing a geostationary satellite.
AT&T Network Failure (1990)
During the afternoon on January 15, 1990, staff at AT&T’s Bedminster, NJ, operations center noticed various warning signals from the company’s world-wide network. The warnings continued until it was apparent that the malfunction was leapfrogging from one computer-operated switching center to another. Managers scrambled, implementing standard procedures to revive the network. Nine hours later the network was stable. However, during that time, 50% of calls did not go through, resulting in a loss of more than $60 million dollars of revenue from unconnected calls.
At the time, AT&T had a reputation for reliability. It handled about 115 million calls on an average day through a computer-operated network of 114 linked switches throughout the U.S. Each of the switches could handle up to 700,000 calls per hour. When a call came into the network from a local exchange, the switch would determine which of the 14 possible connection routes could connect the call. At the same time, the switch would pass the telephone number to a parallel network determining whether a switch at the receiving end could deliver the call to the local phone company. If the destination switch was busy, the first switch would send a busy signal to the caller’s phone and release the line. If the line was available, a network computer would “make a reservation” at the destination switch and pass the call along. The whole process took 4–6 seconds.
Technicians traced the source of the problem to a software upgrade in the switches made several months before the network failure. Though technicians tested the code for the software upgrade, a one-line bug known as a “break statement” inadvertently ended up in each of the 114 switches. When a coded “if” scenario occurred in the system, the break code responded by shutting down the network.
Air Traffic Control System Failure (2014)
On April 30, 2014, a bug in an air traffic control system in use at LAX caused hundreds of flights to be delayed or cancelled. The airport’s En Route Automation Modernization system, or ERAM, cycled off when it detected a spy plane flying near LAX. When ERAM could not detect altitude information in the spy plane’s flight plan, an air traffic controller estimated and entered the altitude in the system. In response, ERAM calculated all possible flight paths to ensure the spy plane would not be on a trajectory to crash into other planes. In the process of performing the calculations, the system fell short of memory and shut down every other flight processing function. Fortunately, no one was hurt.
The flaw in the $2.4 billion Lockheed-Martin ERAM system was that it limited the amount of data each plane could send. While most planes had simple light plans that would not have exceeded the data limit, in the case of the spy plane, the flight plan was complex and brought the system to its data limit.
The Ariane Rocket Launch (1996)
Part of a Western European initiative to transport a few satellites into orbit, the Ariane 5 rocket launched on June 4, 1996, and collapsed a few seconds later. The Ariane project began in 1973 and was intended to give Europe a stronger position in the commercial space business. It took 10 years and $7 billion to achieve the project.
Essentially, the flaw was in two Inertial Reference Systems (IRS) that were operating in parallel within the launcher. Each IRS had its own internal computer that measured the altitude and improvements of the launcher in space. This IRS would send information to the onboard computer to execute the flight plan. Ariane 5’s guidance system shut down 36.7 seconds into lift-off when the onboard computer tried converting IRS data of the rocket from a 64-bit to a 16-bit format. The resulting number was too large causing an overflow error. Technicians failed to allow for the higher horizontal velocity of the Ariane 5 versus its slower predecessor, the Ariane 4.
The ensuing explosion occurred when the guidance system shut down and passed control to the identical IRS backup system which failed in the same way a few milliseconds earlier. The algorithm containing the bug causing the catastrophe had no purpose once the rocket was airborne. In hindsight, the algorithm should have been turned off; instead, the engineers left it on for 40 seconds to allow for an easy restart if there were a brief break in the countdown system.
The failure of this launch increased project costs by $370 million. Those interested in the detailed full report of the failure by the Inquiry Board can peruse it here.
The Great Northeast Blackout (1965)
When electricity usage increased dramatically in the 1950s, many power companies wanted to ensure adequate supplies. One such effort involved joining New York, New England, and Ontario, Canada, in an expansive electrical grid. The underlying concept was that when demand spiked in one area of the grid, others would fill it to prevent shortages and blackouts.
In the case of the Northeast blackout, engineers failed to consider what effects of surging supply in one area might have on the other areas of the grid. The trigger for this epic blackout was a single relay switch on a power line running from Niagara to Ontario. The relay switch had been set to trip if power exceeded a certain level. That is exactly what happened on November 9, 1965. The power load exceeded that level and tripped the switch, causing the power bound for Toronto to surge back into western New York. This surge, in turn, caused generators to shut down to avoid an overload. The surge cycle spread to New York City and eastward to the Maine border. Thirty million people in an 80,000 square-mile area lost power for up to 13 hours. Fortunately, a bright full moon that evening provided some relief for the millions suddenly cast into darkness.
Most of the television stations in the New York metro area, and about half of the FM radio stations, were forced off the air when the transmitter tower atop the Empire State Building lost power. Yet, as extensive as the blackout was, some neighborhoods never went dark, such as Bergen County, New Jersey. Areas spared the blackout were served by electric companies that were not connected to the grid.
Following the incident, measures to prevent another large-scale blackout included the formation of reliability councils to establish standards, share information, and improve coordination among providers. The task force investigating the incident found that a lack of voltage and current monitoring practices were contributing factors. The Electric Power Research Institute partnered with the power industry to develop new metering and monitoring equipment and systems. The current SCADA system (Supervisory Control and Data Acquisition) for remote monitoring and control through coded signals over communication channels system evolved from those post-blackout measures.
As technology becomes more complex and its reach more pervasive, the risk of failure increases as does potential damage. The real lessons of past epic technology failures encourage us to enhance risk abatement activities like scenario planning, system backups, testing, and disaster planning.