Richard Feynman, the Challenger Disaster, and Software Engineering

Challenger Crew

On January 28th, 1986, Space Shuttle Challenger was launched at 11:38am on the 6-day STS-51-L mission. During the first 3 seconds of liftoff the o-rings (o-shaped loops used to connect two cylinders) in the shuttle’s right-hand solid rocket booster (SRB) failed. As a result hot gases with temperatures above 5,000 °F leaked out of the booster, vaporized the o-rings, and damaged the SRB’s joints. The shuttle started its ascent, but seventy two seconds later the compromised SRB pulled away from the Challenger, leading to sudden lateral acceleration. Pilot Michael J. Smith uttered "Uh oh" just before the shuttle broke up. Torn apart by excessive force, it disintegrated rapidly. Within seconds the severed but nearly intact crew cabin began to free fall and seven astronauts plunged to their deaths. I was a child then and remember watching in horror as Brazilian TV showed the footage.

Challenger ExplosionAt the time I didn’t know that SRB engineers had previously warned about problems in the o-rings, but had been dismissed by NASA management. I also didn’t know who Richard Feynman or Ronald Reagan were. It turns out that President Reagan created the Rogers Commission to investigate the disaster. Physicist Feynman was invited as a member, but his independent intellect and direct methods were at odds with the commission’s formal approach. Chairman Rogers, a politician, remarked that Feynman was "becoming a real pain." In the end the commission produced a report, but Feynman’s rebellious opinions were kept out of it. When he threatened to take his name out of the report altogether, they agreed to include his thoughts as Appendix F - Personal Observations on Reliability of Shuttle.

It is a good thing it was included, because the 10-page document is a work of brilliance. It has deep insights into the nature of engineering and into how reliable systems are built. And you see, I didn’t put ’software’ in the title just to trick you. Feynman’s conclusions are general and very much relevant for software development. After all, as Steve McConnell tirelessly points out, there is much in common between software and other engineering disciplines. But don’t take my word for it. Take Feynman’s:

The Space Shuttle Main Engine was handled in a different manner, top down, we might say. The engine was designed and put together all at once with relatively little detailed preliminary study of the material and components. Then when troubles are found in the bearings, turbine blades, coolant pipes, etc., it is more expensive and difficult to discover the causes and make changes.

So software is not the only discipline where the longer a defect stays in the process, the more expensive it is to fix. It’s also not the only discipline where a "top down" design, made in ignorance of detailed bottom-up knowledge, leads to problems. There is however a difference here between design and requirements. The requirements for the engine were clear and well defined. You know, go to space and back, preferably without blowing up. Feynman is arguing not so much against Joel’s functional specs, but rather against top down design such as that advocated by the UML as blueprint crowd. On goes Feynman:

The Space Shuttle Main Engine is a very remarkable machine. It has a greater ratio of thrust to weight than any previous engine. It is built at the edge of, or outside of, previous engineering experience. Therefore, as expected, many different kinds of flaws and difficulties have turned up. Because, unfortunately, it was built in the top-down manner, they are difficult to find and fix. The design aim of a lifetime of 55 missions equivalent firings (27,000 seconds of operation, either in a mission of 500 seconds, or on a test stand) has not been obtained. The engine now requires very frequent maintenance and replacement of important parts, such as turbopumps, bearings, sheet metal housings, etc.

Richard Feynman

Unfortunate top down manner, difficult to find and fix, failure to meet design requirements, frequent maintenance. Sound familiar? Is software engineering really a world apart, removed from its sister disciplines? Feynman elaborates on the difficulty in achieving correctness due to the ‘top down’ approach:

Many of these solved problems are the early difficulties of a new design. Naturally, one can never be sure that all the bugs are out, and, for some, the fix may not have addressed the true cause.

Whether it’s the Linux kernel or shuttle engines, there are fundamental cross-discipline issues in design. One of them is the folly of a top-down approach, which ignores the reality that detailed knowledge about the bottom parts is a necessity, not something that can be abstracted away. He then talks about the avionics system, which was done by a different group at NASA:

The software is checked very carefully in a bottom-up fashion. First, each new line of code is checked, then sections of code or modules with special functions are verified. The scope is increased step by step until the new changes are incorporated into a complete system and checked. This complete output is considered the final product, newly released. But completely independently there is an independent verification group, that takes an adversary attitude to the software development group, and tests and verifies the software as if it were a customer of the delivered product.

Yes, go ahead and pinch yourself: this is unit testing described in 1986 by the Feynman we know and love. Not only unit testing, but ’step by step increase’ in scope and ‘adversarial testing attitude’. It’s common to hear we suck at software because it’s a "young discipline", as if the knowledge to do right has not yet been attained. Bollocks! We suck because we constantly ignore well-established, well-known, empirically proven practices. In this regard management is also to blame, especially when it comes to dysfunctional schedules, wrong incentives, poor hiring, and demoralizing policies. Management/engineering tensions and the effects of bad management are keenly discussed by Feynman in his report. Here is one short example:

To summarize then, the computer software checking system and attitude is of the highest quality. There appears to be no process of gradually fooling oneself while degrading standards so characteristic of the Solid Rocket Booster or Space Shuttle Main Engine safety systems. To be sure, there have been recent suggestions by management to curtail such elaborate and expensive tests as being unnecessary at this late date in Shuttle history.

This is one of many passages. I picked it because it touches on other points, such as the ‘attitude of highest quality’ and the ‘process of gradually fooling oneself’. I encourage you to read the whole report, unblemished by yours truly. With respect to software, I take out four main points:

There are other interesting themes in there, and Feynman’s insight can’t be captured in a few bullet points, much less by me. What do you get out of it?

Feynman's last board at Caltech

Comments

20 Responses to “Richard Feynman, the Challenger Disaster, and Software Engineering”

  1. John Ingle on February 20th, 2008 11:27 am

    You’ve written a fantastic article! I enjoyed reading it and am now excited about reading the full report. It’s amazing what we can learn from history. :) If you know of any similar documents doing post-mortems on the solutions to hard problems, I would be very interested in hearing about them. Especially if they’re written by anyone as engaging as Dr. Feynman!

  2. V-dawg on February 20th, 2008 12:57 pm

    Great Article Gustavo, I couldn’t agree with you more. Fundamental good design practices apply to any engineering discipline.

  3. mantrid on February 21st, 2008 4:36 pm

    Excellent article! I just had a thought that there is nothing bad in top-down approach in some circumstances. We’d like to build systems in top-down manner because it’s simpler than bottom-up and easier to manage. We’d like to build software as easy as we build houses. Perhaps we use top-down approach too early? I mean, top-down is good when we have reliable building blocks and we know almost every assosiations and implications between those bricks. But with today’s software engineering the problem is that we do not have such reliable blocks. As long as we’re discovering design problems and reliable solutions, we should stick to bottom-up approach, what will eventually let us shape those bricks and lead us one day to common building blocks for software. Only then it would be safe to ease the development process with top-down attitude. 10 years ago a developer had to consider hardware specific and low-level OS issues. But in time we learned how to address those problems and came up with hardware abstraction, common protocols, interafces, APIs, patterns, do-s and don’t-s. Today we don’t have to start bottom-up on such a low level as 10 years ago. We can start in a little bit higher level. Perhaps in time we will go higher again? How much higher will we be able to start from?

  4. Paul Wassmund on February 21st, 2008 7:04 pm

    I think it’s largely a matter of finding the correct balance. There IS a place for top-down design - it lets us consider the overall structure of the software, rather than just patching together a bunch of small pieces; it lets us address non-functional requirements such as reliability, extensibility, etc.; it allows us to make use of architectural patterns and design patterns that have been proven to be effective in other real-world applications. However, once we’ve established a design framework, we can build it from the bottom up, following the incremental approach (with testing at each step of the way) advocated here. Even during the top-down phase, we can do deep dives into areas in which there aren’t well-defined patterns to guide us, or where we don’t have experience with new hardware (this is a form of risk mitigation). It’s up to the engineers (and management) to determine what’s appropriate for a given situation, but to state that top-down is inherently bad is, I think, inaccurate.

  5. Austin on February 21st, 2008 7:07 pm

    Hogwash. The point of your article seems to be that top-down design leads to problems. I get this impression from "…’top down’ design, made in ignorance of detailed bottom-up knowledge, leads to problems." Yet, the stuff you later cite from Feynman doesn’t support this. The quote relating a desire for a "step-by-step increase in scope" isn’t about design, it’s about _implementation_. Design and implementation happen in SEPARATE PHASES OF PROJECTS. Top-down design, bottom-up coding. This is the way every software company in America writes software. You can’t design a system from the bottom up. Go ahead and try, I dare you… it doesn’t work out better.

  6. Gustavo Duarte on February 21st, 2008 7:20 pm

    Thank you all for the kind words.

    @John: Feynman’s appendix is by far the coolest thing I know in this regard. I’m glad this thing got Slashdotted so more people got to know it. I don’t know of anything like this off-hand.

    @Mantrid, Paul: I agree _completely_ with what both of you are saying. I’m going to write a follow up clarifying some of these things and what exacly I mean by ‘bottom up’. I’ve had the same discussion over email with someone at NASA. But I think you’re exactly right.

    @no no: Did you read Feynman’s appendix? What did you think of it? Do you think I interpreted it incorrectly?

  7. Austin on February 21st, 2008 9:12 pm

    I did read the appendix, and I don’t think the section you’ve cited is quite relevant. My point is you’re conflating the design and implementation phases in your article here. It’s one thing to criticize big up-front design/waterfall style project management, but another entirely to propose that "top-down" design is bad. Every engineering project is fundamentally top-down: you start with a goal (get a rocket into orbit, say) and figure out the details of how to make that work later. Design _has_ to be top down. Big up-front design, as mentioned by Feynman, is the idea that you can plan everything about a project and then put everything together in a minimal implementation phase. I think you meant to criticize this method of project management, since designs like this fail. But the fact remains, you must have a high-level design phase before you try to implement anything. You can’t design "bottom-up".

  8. cjones on February 21st, 2008 10:10 pm

    Fascinating article and I think it explains very clearly the common differences involved between those who manage a project and those who actually are getting the job done. This is a classic problem in most development shops where the people managing the developers/engineers have never actually coded anything themselves. Austin, stop being such a nerd and enjoy the damn article. You wanna show how much you know go to Slashdot and run your mouth know-it-all.

  9. Steve P. on February 21st, 2008 10:21 pm

    Good article, though you might have emphasized more that Feynman himself did not discover the temperature problem with the O rings. Here the true mark of his independence and basic honesty was that he was willing to publicly and very convincingly demonstrate it in front of cameras after being tipped off about the problem. A lesser person would have allowed the result to have been buried in the report to minimize the embarrassment to NASA and the US government. Feynman correctly realized magnitude of the engineering failure of which the O ring problem was just one symptom, and was not willing to let anybody get off lightly.

  10. Merlin Silk on February 22nd, 2008 12:08 am

    Richard Feynman was a little bit too difficult for when I was in college studying physics, or maybe he was too simple - - thinking about it, it must have been the latter. Because everything else in the lectures and classes was so darn difficult that what he had to say could not be so easy. I still have his Feynman lectures on Physics here on my book shelf, maybe it’s time to try to read them now. And with his ability to see things in simple ways he also impressed me in his biography, "Surely, you must be joking, Mr. Feynman" - so, yes, this guy can think, but what’s more important, he has the ability to live life to the fullest. In software development I have always followed the bottom up approach and caused some nervous break downs in managers because to them it seemed very slow. I was always vindicated because my stuff never needed any big fixes later, while the top-downers were later re-inventing bigger part of their work. So, yes, great article! Merlin http://www.MerlinSilk.com

  11. Gustavo Duarte on February 22nd, 2008 1:18 am

    Thanks again for the compliments.

    @Austin: I do see your point. Here’s what I wrote to someone about this: "I completely agree. That is one thing I do wish I had made clearer. I don’t interpret Feynman’s "bottom up" at all as meaning the chaotic, directionless processes we sometimes see in the industry. As he talks about "preliminary study of materials and components" in relationship to the engine, it’s clear to me that such a study would take place within the context of a plan and a preliminary design. After all engineers can’t randomly test materials until a space shuttle engine crystallizes in front of them :)" I am going to have a follow up entry addressing some of these points. However, I do think that many thoughtful teams are top-heavy, and would benefit from de-emphasizing the schism between design and build. I think you and I do differ in that I do not see the phase rigidity you do. I’m ignoring the chaotic crap shooters in this discussion though, I wouldn’t call them ‘bottom up’, just ‘a mess’.

    @Steve: that’s also a good point. I didn’t want to get too much into the o-ring history though. If I had known this would be read more widely than by my mom and dog, I might have talked some more about the history. So it goes. I’m just really happy that a lot of people got exposed to Appendix F, so all in all I’m happy.

  12. ranjix on February 22nd, 2008 7:47 am

    tried to read earlier but was slashdotted. Anyway. Tried to read the little parallel between the avionics engineering and software. Not very good. Tried to read the appendix from mr. Feinman. I find it boring, irelevant, unconvincing, hiding behind the words. Sorry. Literature, not engineering. Too polite. "Let us make recommendations to ensure that NASA officials deal in a world of reality in understanding technological weaknesses and imperfections well enough to be actively trying to eliminate them." How about "these bulleted list is what I recommend and why: …"? Leave the fluff for the school play. Further. Somehow the texts (above and the appendix) leave impression that top-down=bad and bottom-up=good. Hardly believable. Reader Austin above remarks are right on the spot. Most of the projects start with a goal, followed by a top-down analysis and design. When starting writing code, the process automatically becomes bottom-up. Bottom-up design would mean "hey, I have this screw and this nut, let’s put them together, what do I see? Yeah! It looks like a part of a rocket!" Sorry for the stretched example, but that’s bottom-up for you. Also, there is a tendency to compare software with any other kind of engineering. Frankly I’m not sure it applies. In "normal" engineering, the materials are usually well tested, and the number of "moving pieces" is fairly limited. A bridge has fewer pieces than a car (probably) which has fewer pieces than a rocket. Also, the results of the "normal" engineering are visible, it’s easier to spot a screw with the wrong dimension than one of the 6 constructors with 2 parameters in the wrong order. To make it more like "software engineering", one has to add an unknown number of uncontrollable variables (os, dbs, messaging systems, firewalls, etc, none of them thoroughly tested), would have to make the engineers work in semi-darkness, and would have to make one feel like crap when the project is "done". I guess a question would be - at which point any project becomes unmanageable? be it a rocket or an os. It looks to me like there is a hard (human) limit. but enough for now

  13. Mike Petry on February 22nd, 2008 6:01 pm

    Great post! I appreciate this work and I am definitely a fan of the late Dr. Feynman, but I am a proponent of software design. As a advocate for top-down software design, I typically find myself in the minority and have been meaning to blog on this topic myself. I found this posting to be a compelling counterexample to my own take on the subject and was motivated to post to my own blog. Although a fan of top-down design, I also believe in incremental development and find agile methods to be valid. I just happen to have had a lot of success developing very complex systems using UML. Of course UML is not a silver bullet and must be used in conjunction with good old exploratory hacking or the management friendly term "risk reduction prototyping". Thanks for the compelling and controversial posting.

  14. otter on February 29th, 2008 5:23 am

    Ranjix, you said that [quote]In "normal" engineering, the materials are usually well tested, and the number of "moving pieces" is fairly limited.[/quote] But in the case presented, Feynman stated [quote]The engine was designed and put together all at once with relatively little detailed preliminary study of the material and components.[/quote] Anyways, I thought it was a great article and the opposing comments also gave a lot to think about. I guess that in the end the success of every project, be it software or engineering, will depend on striking the right balance between top-down, bottom-up, and management-worker relations.

  15. mentalmodel on March 8th, 2008 11:18 am

    I think what makes the space shuttle example appropriate for software engineering is that both are systems that are fairly new, unknown, and complex. We haven’t evolved a professional discipline dedicated to space shuttle design and manufacture, so a lot of the problems encountered in the development of a reusable re-entry space vehicle are novel. To me, computer science is to software engineering as physics is to space shuttle design. It’s amazing to reflect on how little we know about software development. That said, I’m sure there was no shortage of self-proclaimed experts in space shuttle design inside of NASA at the time of the Challenger disaster. Will a similar event be necessary in order for there to be a reassessment of how software gets made?

  16. AVEng on March 13th, 2008 3:06 am

    I have worked in the avionics field for 15 years, as a system engineer, a software engineer and a technical lead. I’ve worked on the control systems for commerical jets, for small business jets and for UAVs. I’ve worked on collision avoidance systems, on performance/prediction systems and on flight/navigation sytems. Richard Feynman’s observations are brilliant and exactly on target. He illustrates perfectly the problems I’ve encountered throughout my entire career. The TQM top-down approach does not work, and never has. Plus, the "UML as blueprint" model is an absolutely catastrophic addendum to this already flawed approach, because not only does it abstract the design even further, in the cases where I’ve seen it employed it utterly fails to consider failure modes. No, the top-down approach is only good for one thing - frameworking. You use a top-down approach to design a system’s skeleton. After that, you immediately switch to a rapid prototyping paradigm to backfill the system’s details. That’s the only approach that works, and ALL systems are built this way, whether they official recognize it or not.

  17. Reid Peryam on March 14th, 2008 1:09 pm

    A text book example of Déformation professionnelle ;)

  18. Gustavo Duarte on April 2nd, 2008 9:21 am

    @mentalmodel: that’s a good point, and Feynman explicitly states this in the report ("It is built at the edge of, or outside of, previous engineering experience."). But there’s a lot of engineering that happens where boundaries are being pushed, enough that I don’t think it’s fair to say "standard" engineering is about solving previously solved problems. Civil engineering, at least from the outside, is one where it does look like things are pretty formulaic by now - after all we’ve had buildings for a few thousand years (though things aren’t so clear, as per the Steve McConnell’s articles above). But in most branches I think the envelope is pushed more often than not. So the techniques for dealing with the new are fundamental to the field of engineering as a whole, and here there’s a lot of common ground with software.

    @AVEng: thanks for the feedback, it’s always interesting to hear from people doing the kinds of systems Feynman talked about. Your take on top/down is exactly what I do in my own projects: use top/down for a general skeleton, and then switch to reality: experimentation and prototyping.

  19. Me, myself and I on June 19th, 2008 12:37 am

    IMO, there’s nothing wrong with a top-down approach at the right place.

    More exactly: you should use a top-down approach to detail the problem to be solved and to chop it into many more smaller problems, and then use a bottom-up approach to solve the individual problems.

    I always make a diference between architecture and design/engineering. Architecture is a direct result of requirements, and IMO transforming requirements into architecture is OK, since it is a safe way to create a vision of a product which matches the customer requirements. Nevertheless, it just creates a vision, one that may be proven unrealistic by engineers.

    However, letting engineers design the whole system bottom-up, starting from whatever they think is right and trying to adapt to each and every customer requirement incrementally, is IMO a bad approach, potentially yielding an extraordinary product from a technical point of view, but which no customer/end user would want to use, and p[otentially investing a lot more effort than appropriate.

    Of course, it greatly adds to the quality of design if architects are good programmers, and if engineers have direct contact to customers and are given a chance to train in understanding customer and end-user requirements - as a programmer, really putting yourself in an end-user’s shoes is harder than you may think.

    What do you have against UML? It’s a perfectly valid documentation tool, mostly for allready written code, even if when used to it you start by creating your code skeleton in UML - the blueprint. Of course, there are potentially many wrong uses of UML, but IMO it is a wonderful tool for documentation - of course, a bad designer will do a bad design with or without UML, but a good designer will do a good design easier using UML. Besides, no respectable engineering discipline works without blueprints, therefore I suppose UML is rather a part of the solution than a part of the problem.

    As for prototyping, it is IMO the worst approach. It should be used only when a completely new problem has to be solved, one for which there is nothing related ever done before, for which even requirements need to be defined by prototyping. It is the most expensive solution to drive a SW project. (We just finished such a project, where prototying was a customer requirement. The product is incomplete, it took at least twice as much time to build, and it is our estimation that by using our traditional approach we could have delivered the same product, with more features, with a three times lower budget.)

  20. Gustavo Duarte on June 19th, 2008 10:28 pm

    @Me, myself and I:

    Thanks for the well written comment.

    You know, these discussions are hard because the terms are so ill-defined. It’s not as if we’re looking at a concise, well-defined description of a process and then analyzing that. With that in mind, here’s my answer to some particular points. Also, the post following this one, Reality-Driven development, addresses some of these points.

    So here we go:

    Yes, you do need some top down analysis to be sure. This point is addressed in the following post.

    Regarding vision, it can both help and hurt. It often means people stick to an idea or plan, and start ignoring reality and failing to adapt. This happens in many spheres of life, far beyond just software. It’s like CEOs whose “visions” get companies stuck in an ill-fated trail, ignoring markets, ignoring new evidence, pursuing grand plans full of risk rather improving incrementally on the fundamentals. Then blowing the company up.

    In software, what you get is a vision that more often than not:

    1) Comes from people who are technically incompetent, ignorant, and haven’t done software for years

    2) Is set in stone, due to organizational realities. Leading to:

    3) One hell of a mess where programmers need to work around brain-dead architectures.

    Does that mean that architecture is bad? No, of course not. You can do it right, but imho doing it right means realizing that architecture is a _proposed_ way to handle things, a sort of “hmm, let’s start going over here” rather than a detailed map to crapdom.

    However, here the lack of any rigor in definitions bites us. It seems we’re saying something similar, since you say that it “may be proven unrealistic by engineers”. I suspect we’re more in agreement than not regarding this pont.

    I completely agree with the cost thing. Check the next post, where I say that proper high-level design and requirements can dramatically cut cost and risk.

    I think architects who are not good programmers should be shot. The ones who don’t program at all should be impaled. This idea of the “non-programming architect” is one of the most stupid things going on in enterprisey development imho.

    I totally agree on the user’s shoes thing. That’s one of the big failures of software development, the lack of attention to this. I think _every_ developer on a team should spend time _living the user’s day_ at some point, as early as possible, truly understanding - and hopefully doing, briefly - the job that will be impacted by the software.

    Regarding UML, I don’t have anything against UML itself, it’s a tool as you point out. Can be used well or poorly. Trouble is the latter is far more frequent. I’ve done much UML over the years, used several tools: Poseidon, Enterprise Architect, Rational stuff, Visio, you name it. I can do diagrams pretty well and pretty fast. So I’d say I’m an “experienced” UML user.

    I’d say this:

    1) UML diagrams often communicate less effectively than a simple block diagram or English text. They’re also expensive to produce in many organizations.

    2) Since people don’t really read the books and the specs, UML is pretty loose. I’ve seen so many non-sensical package, component, and class diagrams, gosh, it’s crazy. Is that UML’s “fault”? Well, yes and no, but we must evaluate UML in terms of its real-world usage, not some ideal UML conforming paradise.

    3) Designing software classes graphically is silly. Whiteboarding with UML to get ideas going? Sure. Sitting down and “designing” software in UML? I think that’s a very bad way to go, often used by the non-coding architect types.

    4) The blueprint of the software engineering discipline is code. For software, code _is_ design. This is a fundamental point for me, something I believe deeply in. Diagrams are aids to _understanding_ the blueprint. It’s like looking at diagrams of an aircraft design that give you a high-level overview of what’s going on, to help you make out the forest Here UML can indeed help.

    5) Again, your usage of UML sounds very sane to me. Using it as a documentation tool for already written code: perfect. Using it as a tentative skeleton: perfect, though for me it’s more efficient to just use code for that. But the “UML as blueprint” crowd belabors under the illusion they can nail down every aspect of a “design”, every class and field and method, and then hand it down to programmers. _That_ is insanity imho.

    Nowadays I use UML in three primary ways: light whiteboarding, as understanding aids in architecture documents (my role is often of “architect”), and as documentation for a built system. But I must say in general I think UML has been more of problem than solution for software.

    Regarding the prototyping, I fully agree that decent requirements and good high-level direction at the start can reduce cost.

    This is an interesting discussion, I love to talk about this stuff. Thanks again for the well-made points, I wish we had more time to define things properly and speak more precisely about all this. I hope to start posting more often, then I’ll elaborate on some of these points.

Leave a Reply