Of Aviation Crashes and Software Bugs
I just found out that Stephen Colbert’s father and two brothers died in a plane crash on September 11, 1974. Maybe everybody knows this – I’m not sure because I haven’t watched TV in years, so I live in a sort of alternate reality. My only exposure to TV are YouTube clips of Jon Stewart, Colbert, and lots of Dora The Explorer (Jon Stewart is my favorite but Swiper The Fox is a close second, don’t tell my kids though). Now, I may not have TV to keep me informed, but I do read aircraft accident reports and transcripts from cockpit voice recorders. That doesn’t help in small talk with the neighbors, but you read some amazing stuff.
For example, in the accident that killed Colbert’s father the pilots were chatting about politics and used cars during the landing approach. They ignored their altitude and eventually ran the plane into the ground about 3 miles away from the destination airport. The report by the National Transportation Safety Board (NTSB) states that “both crew members [first officer and captain] expressed strong views and mild aggravation concerning the subjects discussed.” Since the full CVR transcript is not available we’re free to imagine a democrat and a republican arguing amid altitude alerts.
Aviation accidents are both tragic and fascinating; few accidents can be attributed to a single factor and there is usually, well, a series of unfortunate events leading to a crash. The most interesting CVR transcript I’ve read is Aeroperu 603. It covers an entire flight from the moment the airplane took off with its static ports taped over – causing airspeed, altitude, and vertical speed indicators to behave erratically and provide false data – until the airplane inverted into the Pacific Ocean after its left wing touched the sea, concluding a mad, slow descent in which crew members were bombarded with multiple, false, and often conflicting flight alerts. The transcript captures the increasing levels of desperation, the various alerts, and the plentiful cussing throughout the flight (there’s also audio with subtitles). As you read it your brain hammers the question: how do we build stuff so things like this can’t happen?
Static ports covered by duct tape in Aeroperu 603
The immediate cause of the Aeroperu problem was a mistake by a ground maintenance worker who left duct tape over the airplane’s static ports. But there were a number of failures along the way in maintenance procedures, pilot actions, air traffic control, and arguably aircraft design. This is where agencies like the NTSB and their counterparts abroad do their brilliant and noble work. They analyze the ultimate reason behind each error and failure and then issue recommendations to eradicate whole classes of problems. It’s like the five whys of the Toyota Production System coupled with fixes and on steroids. Fixes are deep and broad, never one-off band aids.
Take the Colbert plane crash. You could define the problem as “chatter during landing” and prohibit that. But the NTSB went beyond, they saw the problem as “lack of professionalism” and issued two recommendations to the FAA with a series of concrete steps towards boosting professionalism in all aspects of flight. Further NTSB analysis and recommendations culminated a few years later in the Sterile Cockpit Rule, which lays down precise rules for critical phases of flight including take off, landing, and operations under 10,000 feet. Each aviation accident, error, and causal factor spurs recommendations to prevent it, and anything like it, from ever happening again. Because the solutions are deep, broad, and smart we have achieved remarkable safety in flight.
In other words, it’s the opposite of what we do in software development and computer security. We programmers like our fixes quick and dirty, yes sirree, “patches” we call them. It doesn’t matter how critical the software is. Until 1997 Sendmail powered 70% of the Internet’s reachable SMTP servers, qualifying it as critical by a reasonable measure (its market share has since decreased). What was the security track record? We had bug after bug after bug, many with disastrous security implications, and all of them fixed with a patch as specific as possible, thereby guaranteeing years of continued new bugs and exploits. Of course this is not as serious as human life, but for software it was pretty damn serious: these were bugs allowing black hats to own thousands of servers remotely.
And what have we learned? If you fast forward a few years, replace “Sendmail” with “WordPress” and “buffer overflow” with “SQL injection/XSS”, cynics might say “nothing.” We have different technologies but the same patch-and-run mindset. I upgraded my blog to WordPress 2.5.1 the other day and boy I feel safe already! Security problems are one type of bug, the same story happens for other problems. It’s a habit we programmers have of not fixing things deeply enough, of blocking the sun with a sieve.
We should instead be fixing whole classes of problems so that certain bugs are hard or impossible to implement. This is easier than it sounds. Dan Bernstein wrote a replacement for Sendmail called qmail and in 1997 offered a $500 reward for anyone who found a security vulnerability in his software. The prize went unclaimed and after 10 years he wrote a paper reviewing his approaches, what worked, and what could be better. He identifies only three ways for us to make true progress:
- Reduce the bug rate per line of code
- Reduce the amount of code
- Reduce trusted code (which is different than least privilege)
This post deals only with 1 above, I hope to write about the other two later on. Reducing the bug rate is a holy grail in programming and qmail was very successful in this area. I’m sure it didn’t hurt that Bernstein is a genius, but his techniques are down to earth:
For many years I have been systematically identifying error-prone programming habits—by reviewing the literature, analyzing other people’s mistakes, and analyzing my own mistakes—and redesigning my programming environment to eliminate those habits. (…)
Most programming environments are meta-engineered to make typical software easier to write. They should instead be meta-engineered to make incorrect software harder to write.
In the 1993 book Writing Solid Code Steve Maguire gives similar advice:
The most critical requirement for writing bug-free code is to become attuned to what causes bugs. All of the techniques in this book are the result of programmers asking themselves two questions over and over again, year after year, for every bug found in their code:
- How could I have automatically detected this bug?
- How could I have prevented this bug?
For a concrete example, look at SQL Injection. How do you prevent it? If you prevent it by remembering to sanitize each bit of input that goes to the database, then you have not solved the problem, you are using a band aid with a failure rate – it’s Russian Roulette. But you can truly solve the problem by using an architecture or tools such that SQL Injections are impossible to cause. The Ruby on Rails ActiveRecord does this to some degree. In C# 3.0, a great language in many regards, SQL Injections are literally impossible to express in the language’s built-in query mechanism. This is the kind of all-encompassing, solve-it-once-and-for-all solution we must seek.
It’s important to take a broad look at our programming environments to come up with solutions for preventing bugs. This mindset matters more than specific techniques; we’ve got to be in the habit of going well beyond the first “why”. Why have we wasted hundreds of thousands of man hours looking for memory leaks, buffer overflows, and dangling pointers in C/C++ code? It wasn’t just because you forgot to free() or you kept a pointer improperly, no. That was a symptom. The reality is that for most projects using C/C++ was the bug, it didn’t just facilitate bugs. We can’t tolerate environments that breed defects instead of preventing them.
Multi-threaded programming is another example of a perverse environment where things are opposite of what they should be: writing correct threading code is hard (really hard), but writing threading bugs is natural and takes no effort. Any design that expects widespread mastery of concurrency, ordering, and memory barriers as a condition for correctness is doomed from the start. It needs to be fixed so that bug-free code is automatic rather than miraculous.
There are a number of layers that can prevent a bug from infecting your code: software process, tools, programming language, libraries, architecture, unit tests, your own habits, etc. Troubleshooting this whole programming stack, not just code, is how we can add depth and breadth to our fixes and make progress. The particulars depend on what kind of programming you do, but here are some questions that might be worth asking, in the spirit of the questions above, when you find a bug:
- Are you using the right programming language? Does it handle memory for you? Does it help minimize lines of code and duplication? (Here’s a good overall comparison and an interesting empirical study)
- Could a better library or framework have prevented the bug (as in the SQL Injection example above)?
- Can architecture changes prevent that class of bug or mitigate their impact?
- Why did your unit tests fail to catch the bug?
- Could compiler warnings, static analysis, or other tools have found this bug?
- Is it at all possible to avoid explicit threading? If so, shun threads because they’re a bad idea. Otherwise, can you eliminate bugs by isolating the threads (reduce shared state aggressively, use read-only data structures, use as few locks as possible)?
- Is your error-handling strategy simple and consistent? Can you centralize and minimize catch blocks for exceptions?
- Are your class interfaces bug prone? Can you change them to make correct usage obvious, or better yet, incorrect usage impossible?
- Could argument validation have prevented this bug? Assertions?
- Would you have caught this bug if you regularly stepped through newly written code in a debugger while thinking of ways to make the code fail?
- Could software process tools have prevented this bug? Continuous integration, code reviews, programming conventions and so on can help a lot. Can you modify your processes to reduce bug rate?
- Have you read Code Complete and the Pragmatic Programmer?
As airplanes still crash we’ll always have our bugs, but we could do a lot better by improving our programming ecosystem and habits rather than just fixing the problem of the hour. The outstanding work of the NTSB is great inspiration. I’m still scared of flying though – think of all the software in those planes!
Comments
26 Responses to “Of Aviation Crashes and Software Bugs”
Leave a Reply
[...] Of Aviation Crashes and Software Bugs : Gustavo Duarte If airplanes were flown like software is written, we’d all be dead. (tags: programming languages debugging quality testing) Tags: c, codinghorror, comparison, consulting, creativity, debugging, development, eiffel, java, languages, perl, php, programming, python, quality, questions, ruby, smalltalk, techrepublic, testing, visualbasic [...]
Excellent post as usual.
I would add to your list:
* Have you been effectively trained in the use of the tools at your current job?
* Have you been effectively trained in the business rules of your current job and do you know why you are writing the code you have been asked to write?
In my career the above has almost never happened at any of my jobs. I would even go so far as to say that the ability to function in the absence of the above defines what we call a “good” programmer.
Other industries would never tolerate the absence of specific and relevant on the job training that is common in software development teams.
I checked out the Writing Solid Code page at Amazon, and it looks like it’d be a good read with one unfortunate exception — it seems, from what the Amazon page says, to be very Microsoft-platform centric. Is that an incorrect impression, or would I basically need to start doing more development for Microsoft platforms and with Microsoft tools to be able to make the best use of the book?
Also . . . you didn’t exactly endorse the book, you just quoted it. Is it actually all that good overall?
As for the rest of the essay, I’ve got to say that I find it pretty agreeable. Thanks for a good write-up of the topic.
I’ve often joked that the quickest way to reduce the amount of trusted code is to require all code that runs with privileges to be coded in upper case.
…you know how much programmers and anyone with a clue hates upper case.
I’m telling you… it would work!
Hi there,
Thanks for the comments, I’m glad you guys like the post.
@Andrew: I totally agree. It’s insane, really, the way most companies handle this. Especially because it’s a penny wise pound foolish thing: they waste a lot of money due to this lack of training.
This post sort of focused on security – in that area I see companies scrambling to hire application security people, spending a ton of money to patch broken apps, etc – when they could easily train people on how to write better code, and most programmers would actually enjoy learning.
The point about the business rules is even better though. Gosh, how many projects were horribly off-target because people didn’t have the _faintest_ idea of the business behind the app – not to mention the other way around, how much improvement is missed on because programmers don’t know how they could help out.
Yea, training definitely belongs in the list
@apotheon: It does look like that from the Amazon page, but it’s not at all Microsoft centric. I guess back in 1993 it was cool to put “Microsoft” on book covers
How times change. hehe.
Now, the book _is_ C-centric. The examples are all C, and some of the information is specific to C. I’d say if you don’t do C programming, you would miss out on some of the book’s value.
With that caveat, I do think it’s an excellent read, full of good advice, well written, and the author is clearly a great programmer.
@anonymoustroll: hahahah. Sounds good to me
. Maybe we should write Bernstein and suggest it.
Excellent post. But I think we software engineers have it all wrong – it’s not the implementation that’s wrong, it’s the design that’s wrong:
http://softwareindustrialization.com/TheHumbleSoftwareEngineer.aspx
one of the software development processes that works a little closer to the NTSB/FAA feedback loop would the code that launches the space shuttle.
I remember seeing an dead-tree article about that several years ago, but unfortunately i don’t remember the publication or even know if it’s available online.
Ask around… it’s worth the effort to track it down if you’re really interested in how software development can be done right.
@Mitch: thanks for the link, this Alloy stuff from MIT is really interesting. My undergrad was math and I loved analysis and proofs, so I’m pretty interested in this kind of thing. Bernstein himself is a mathematician. However, I hold the standard position that the ‘traditional’ software proofs are impractical, but I also think that there’s likely to be a middle ground where the compiler can do a lot more work for us.
With threading for example, it seems SO likely that you should be able to _state_ the behavior you need with respect to concurrency and let the compiler figure it out, rather than doing lock()s yourself.
So, anyway, some sort of hybrid between full verification and zero verification. I need to read more on this, I’m pretty ignorant, so thanks for the link.
@anonymoustroll: Sounds interesting, thanks for the tip. I’m currently without a way to do good journal / academic searches, but I’m setting one up soon, and I’ll look for this. I have read some about the shuttle software engineering (I wrote a post about Feynman, the Challenger, and software in February), but I’d love to learn more about that stuff.
> Is it at all possible to avoid explicit threading? If so, shun threads because they’re a bad idea.
Threads are not inherently a bad idea, and using them does not cause errors any more than using a relational database causes SQL injection. Just like the solution to SQL injection is a layer of abstraction over the underlying SQL, the solution to threading-related bugs is to use a language or library that abstracts away the details. The Erlang language is an excellent example of this.
[edited at 3:24pm]
@Name: My first response was hurried – I was going to the park with my kids. heheh. I actually completely agree with what you say here. The concept of threading is not inherently busted – it better not be, given the reality of multi-cores. But the way most languages approach is screwed, Erlang is an exception to the rule.
Here’s the original:
@Name: it is _far_ more difficult to prevent threading problems in most languages than it is to prevent SQL injections. And the threading API of most languages is inherently a bad idea.
But you’re right, Erlang is a great example of how to go about it. We also need strategies for other languages though, and I suspect using immutable/read-only data structures might be a big part of it.
> It does look like that from the Amazon page, but it’s not at all Microsoft centric. I guess back in 1993 it was cool to put “Microsoft” on book covers How times change. hehe.
Okay, thanks. That’s a relief. I’m moving it from “low” priority on my Amazon wishlist up to “medium”, as a result of that clarification.
> Now, the book _is_ C-centric. The examples are all C, and some of the information is specific to C. I’d say if you don’t do C programming, you would miss out on some of the book’s value.
That’s perfectly fine by me. One of my next major goals with regard to programming projects (by the end of this year, I hope) is to start seriously re-familiarizing myself with C. It’s been a while since I’ve really done anything with C, and I feel the need to get back to it. So: sounds like a good book to add to the queue.
As Dijkstra said, “program testing can be used to show the presence of bugs, but never to show their absence!”
I hope you get a chance to check Alloy out more closely because it is definitely not like “traditional” formal methods:
http://www.sciam.com/article.cfm?id=dependable-software-by-de&print=true
Keep up the good writings!
@apotheon: you’re welcome. Now that you mention you’re particularly interested in C though, there’s another book I would recommend:
Expert C Programming
This one is more popular, it’s a general (advanced) C book and not focused on correct code like Writing Solid Code. But it’s a really good book, very well written, one of the best programming books I’ve read. The guy is a master of C, goes over exactly the trouble spots, and writes in a fun (not dorky) way and mixes in some cool CS examples and folklore. Really cool book, I think all C programmers should read it, and due to the writing style it’s pleasurable to do so.
@Mitch: cool, I’m definitely going to check it out.
After a couple of minutes of google searching (I guess the article isn’t *that* old):
http://www.fastcompany.com/node/28121/print
Thanks for the recommendation. I’ve added that to my Amazon wishlist, too.
[...] Of Aviation Crashes and Software Bugs Spying on Computer Monitors Off Reflective Objects Krai Mira: Work in Progress MMO Bir OWASP SoC’08 Projesi – SQLiBENCH Krai Mira: Work in Progress MMO [...]
enjoyed your article, keep up the good work!
@Mitch: It’s a little bit ironic that you use Alloy as your example of static verification, since it operates under the assumption that counterexamples will be small. Alloy does bounded verification, guaranteeing that your model is correct — for models up to some size. What Alloy does not offer is a wholesale proof of correctness.
Better examples of correct-by-construction programs would be the Coq/Isabelle/etc. crowd, which can guarantee rigorous correctness, as well as some of the fancier systems like Sage (which makes an interesting compromise between static and dynamic guarantees, treating propositions that the theorem prover can’t guarantee as assertions to be checked at run time). Granted, all of those systems still aren’t ready for use as real programming languages, and are entirely unsuited to some tasks. On the other side of the coin, (unbounded!) SAT solvers offer real proof at cheaper and cheaper prices (e.g., Saturn).
Hopefully the success of GC in (now, rather efficiently) eliminating a class of problems will eventually spread to other elements of programming languages and design processes. Tools like Alloy, Coq, and Saturn all have something to contribute.
I feel very much the same way. Lets face it. Most software sucks. Look at the source code of any open source application and it’s clear they have had more than one developer working on the source.
Something I strive for is such high consistency that it looks like my code was generated. I find consistency helps me keep my train of thought without getting lost in little details.
Architecture is also something usually missing from most PHP software projects. Look at all the open source projects and try and find any kind of consistency in architecture and you’ll discombobulated in minutes.
http://www.fastcompany.com/magazine/06/writestuff.html
I see someone already posted that link above but it’s always been a motivator for me to write solid software.
Their bugs per line is extremely low and rightfully so…I think I got into the wrong area of software development. I should have tried to get into mission critical systems as opposed to buggy, slow, insecure, bloated carelessly written software.
[...] Of Aviation Crashes and Software Bugs Spying on Computer Monitors Off Reflective Objects Krai Mira: Work in Progress MMO Bir OWASP SoC’08 Projesi – SQLiBENCH Krai Mira: Work in Progress MMO [...]
[...] Of Aviation Crashes and Software Bugs Spying on Computer Monitors Off Reflective Objects Krai Mira: Work in Progress MMO Bir OWASP SoC’08 Projesi – SQLiBENCH Krai Mira: Work in Progress MMO [...]
I liked reading your blog…keep up the good work.
The blog contains nice as well important tips avaition safety. I like it. And would like to congradulate the author.
msot computer monitors these days are already using LCD technology and some are LED-LCD “;~
You mentioned a fear of flying due to the software. I believe statistically very few crashes are caused by faulty software, it is mostly human error!
[...] Re: The crash of Air France 447 Aircraft safety AND software. Enjoy Of Aviation Crashes and Software Bugs : Gustavo Duarte [...]