Archive for March, 2006

March 26, 2006

THE STORY OF A SOFTWARE FAILURE
Greatest Debacle in the History of Organized Work?

About a year ago, I wrote about the failure of the FBI’s Virtual Case File System…which was intended, among its other purposes, to permit the integration of multiple sources of data about terrorist activity. Failures of large software projects are, of course, nothing new. One of the most famous failures was the FAA’s project to develop the “Advanced Automation System” for air traffic control. Robert N Britcher, who was involved with the project and has studied and written about it extensively, remarks that it may have been the greatest debacle in the history of organized work.

Before going further, I’d like to emphasize that this post is not intended as FAA-bashing. I think the FAA’s air traffic control organization, in particular, does, for the most part, a pretty darned good job. It is very, very rare for airplanes under ATC control to run into each other, and that is, after all, what it’s all about. Much of the strident criticism directed at the FAA in the media seems to be to be both unfair and not very knowledgeable.

The Advanced Automation System, though, really does seem to have been a goat rodeo on a truly amazing scale.

The AAS, which was begun in 1981, was to be a revolutionary system that would drive sweeping change in all aspects of air traffic control–“as radical a departure from well-worn mores and customs as the overflow of the czars,” as Britcher puts it. Computers had been used in air traffic control since the late 1960s, when an IBM-based system for enroute control was put in place, along with a UNIVAC-based system for terminal-area control. These systems operated successfully for many years, but by the early 1980s, they were growing a bit long in the tooth. Air traffic had dramatically expanded, and congestion was increasing. The federal government was facing budget pressures, and was looking for cost savings. And, in the aftermath of the controllers strike (1981), anything that would reduce staffing requirements was attractive to FAA management. Finally, radical automation projects were in the zeitgeist–it was also in the early 1980s that Roger Smith would kick off his gigantic (and ultimately not very successful) project for the comprehensive use of robotics in General Motors assembly plants.

The radical ambitiousness of the AAS was described metaphorically by an engineer who worked on the project:

You’re living in a modest house and you notice the refrigerator deteriorating. The ice sometimes melts, and the door isn’t flush, and the repairman comes out, it seems, once a month. Then you notice it’s bulky and doesn’t save energy, and you’ve seen those new ones at Sears. The first thing you do is look into some land a couple of states over, combined with several other houses of similar personality. Then you get I M Pei and some of the other great architects and hold a design run-off…

The design run-off, in this case, was between Hughes and IBM. IBM won. The $3.7 billion contract was celebrated with a great ball at Union Station in Washington, DC, featuring Chubby Checker and “The Twist.”

Almost immediately things started to go wrong.

The new system was to put great emphasis on improving the visual display of information, via large, crisp, full-color display screens, and facilitating the controller’s interaction with that information. “Human factors experts” were employed to assist in this process. “Thousands of labor-months were spent designing, discussing, and demonstrating the possibilities: colors, fonts, overlays, reversals, serpentine lists, toggling, zooming, opaque windows–the list is too long for this summary.” Since human factors are so subjective, endless argument was possible. The problems were made even more complex by the FAA’s absolute insistance that AAS was to be an entirely paperless system–the paper flight strips, previously printed out for each individual flight being tracked, were to disappear and be replaced by some virtual incarnation on the screen.

The new system was to be fully distributed. Processing would take place at each controller’s workstation: there would be no centralized server for an an ATC facility. This would, of course, achieve a high degree of fault tolerance and availability. However, no one had fully thought out the problems of keeping the information in all these computers fully synchronized, and these problems turned out to be a lot harder than had been envisaged. “Over the years, I would observe tests and notice that the many instances of the altitude of one aircraft, spread across various workstations, would not match,” Britcher writes. Also, the system would need to be updated for new software without ever being shut down–“changing the fan belt while the engine is running,” in the hackneyed but vivid metaphor. The decision to operate without manual backup–necessary in the absence of the printed flight strips–made continuous operation even more critical. But no one really knew how to solve the update-while-running problem in a fully general way.

The system development was to be tightly managed. Everything was to be documented in detail and carefully reviewed (“Despite the tens of millions of dollars spent on new computers for the AAS, the most important piece of equipment on the project was the overhead projector). Procedures were to be followed without deviation. Britcher reproduces a memo that captures the sprit of the suffocating bureaucracy:

Subject: In your mail (in reference to Harmon’s mistake)

In your mail you will be receivieving a copy of a letter from Riebau to Dennis Trippel. The letter expresses a concern that IBM is modifying PU10 without FAA approval, namely by changing STNs (Software Technical Notes) that are fulfilling PU10 DID requirements…I will be working with my representatives to determine what ramifications this may have on the mechanism that we already have in place for getting FAA approval for STN changes. I hope that whatever we work out will be little to no impact on our internal process of STN change.

Have a nice day, Jenny.

What was this all about? Harmon, an IBM lead programmer, had decided to modify the procedure for code reviews. While the procedure had been for code to be read aloud by a third party, Harmon had decided this was too cumbersome, and for his team had directed that the programmer read his own code aloud at the reviews. It turned out that this was not in accordance with the system development plan, considered to be part of the contract. This was escalated to the highest levels on the project within both IBM and the FAA: the outcome, after expenditure of much executive time, was that Harmon was directed to return to the old way of doing things.

The project went on and on. Schedules slipped. Management changes were made. Schedules slipped some more. Despite all the “human factors” efforts, there were serious interface issues. One controller, after reviewing a prototype of the systems, said (on the CBS evening news) “It takes me twelve commands to do what I used to do with one.”

What was it like to work on this project? Rummaging through a closet one day, Britcher found an envelope left by someone who had left the company. The envelope contained a hand-printed document titled “A Brief History of the Advanced Automation System.”

A young man, recently hired, devotes years to a specification written to the bit level that will never be coded. Another, to a specification that will be replaced. Programmers marry one another, then divorce and marry someone in another subsystem. Program designs are written to severe formats, then forgotten. A man decides to become a woman and succeeds before system testing starts. As testing approaches, she begins a second career on local television, hosting a show on witchcraft…An ambitious training manager builds an encyclopedia of manuals no one will use. Decisions are scheduled weeks in advance…Human factors experts achieve Olympian status. The Berlin Wall collapses. The map of Europe is redrawn. Everything is counted…Dozens of men and women argue for thousands of hours: What is a requirement? A generation of workers retires. The very mission changes and only a few notice. Programming theories come and go. Managers cling to expectations, like a child to a blanket…The years rip by with no end in sight. A company president gets an idea: Make large small. Turn methods over to each programmer. Dress down. Count on the inscrutability of programming. Promote good news. Turn a leaf away from the sun. Maybe start over.

In 1994, David Hinson, the new FAA Administrator, terminated the project. The nation’s air traffic would continue to be directed by the code written in the late 1960s, as that code had been modified and extended over time.

In a sense, we were very fortunate in the aftermath of this project: no one was killed. As near as I can determine, there were no midair collisions that took place as a result of the unavailability of the enhanced capabilities which were to have been provided by the AAS. This speaks well for the work done by the original programmers back in the 1960s, whose work had proved robust enough to evolve far beyond what anyone could have reasonably expected when it was first developed, and for the skill and dedication of the nation’s controllers.

We may not be so lucky next time. Almost certainly, there are systems in the world for which the consequences of failure or unavailability could be far worse than even for the ATC system. Let us pray that one of these systems does not turn, in retrospect, out to have been the FBI Virtual Case File system.

It is important to think about why major software failures happen, and what can be done to minimize the occurence of such failures in the future.

So, why did AAS fail? Congress, particularly Sen William Cohen, was quick to blame “poor management.” Britcher, though, has a problem with this facile explanation:

Poor management? For decades, the FAA has managed to keep the most complex system on earth running 24 hours a day, 7 days a week. I know of no more competent and committed managers.

Nor does Britcher blame the programmers who worked on the project: I don’t know how they endured. They did good work. In the shank of the project, the suffocating bureaucracy and the beckoning of video games their friends in other places were creating in the new age of software did not affect them. They were a cadre of disciplined programmers, who would not cut corners on writing well-documented and thoroughly-read code, who tested their bottoms off, and believe in metrics, who wrote reliable programs, when they had every reason to give in to hurriedness. Or just give in.

So why is the failure rate so high among major software projects? In a future post, I’ll discuss some of Britcher’s thoughts, along with some of my own.

Britcher’s writes about his experiences with the AAS in Software Runaways, by Robert L Glass, and in his own book The Limits of Software.

Also, more on the Virtual Case File System here.