Closures, Objects, and the Fauna of the Heap

The last post in this series looks at closures, objects, and other creatures roaming beyond the stack. Much of what we’ll see is language neutral, but I’ll focus on JavaScript with a dash of C. Let’s start with a simple C program that reads a song and a band name and outputs them back to the user:

(stackFolly.c) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#include <stdio.h>
#include <string.h>

char *read()
{
    char data[64];
    fgets(data, 64, stdin);
    return data;
}

int main(int argc, char *argv[])
{
    char *song, *band;

    puts("Enter song, then band:");
    song = read();
    band = read();

    printf("\n%sby %s", song, band);

    return 0;
}

If you run this gem, here’s what you get (=> denotes program output):

1
2
3
4
5
6
7
./stackFolly
=> Enter song, then band:
The Past is a Grotesque Animal
of Montreal

=> ?ǿontreal
=> by ?ǿontreal

Ayeee! Where did things go so wrong? (Said every C beginner, ever.)

It turns out that the contents of a function’s stack variables are only valid while the stack frame is active, that is, until the function returns. Upon return, the memory used by the stack frame is deemed free and liable to be overwritten in the next function call.

Below is exactly what happens in this case. The diagrams now have image maps, so you can click on a piece of data to see the relevant gdb output (gdb commands are here). As soon as read() is done with the song name, the stack is thus:

At this point, the song variable actually points to the song name. Sadly, the memory storing that string is ready to be reused by the stack frame of whatever function is called next. In this case, read() is called again, with the same stack frame layout, so the result is this:

The band name is read into the same memory location and overwrites the previously stored song name. band and song end up pointing to the exact same spot. Finally, we didn’t even get “of Montreal” output correctly. Can you guess why?

And so it happens that the stack, for all its usefulness, has this serious limitation. It cannot be used by a function to store data that needs to outlive the function’s execution. You must resort to the heap and say goodbye to the hot caches, deterministic instantaneous operations, and easily computed offsets. On the plus side, it works:

The price is you must now remember to free() memory or take a performance hit on a garbage collector, which finds unused heap objects and frees them. That’s the fundamental tradeoff between stack and heap: performance vs. flexibility.

Most languages’ virtual machines take a middle road that mirrors what C programmers do. The stack is used for value types, things like integers, floats and booleans. These are stored directly in local variables and object fields as a sequence of bytes specifying a value (like argc above). In contrast, heap inhabitants are reference types such as strings and objects. Variables and fields contain a memory address that references these objects, like song and band above.

Consider this JavaScript function:

1
2
3
4
5
function fn()
{
    var a = 10;
    var b = { name: 'foo', n: 10 };
}

This might produce the following:

I say “might” because specific behaviors depend heavily on implementation. This post takes a V8-centric approach with many diagram shapes linking to relevant source code. In V8, only small integers are stored as values. Also, from now on I’ll show strings directly in objects to reduce visual noise, but keep in mind they exist separately in the heap, as shown above.

Now let’s take a look at closures, which are simple but get weirdly hyped up and mythologized. Take a trivial JS function:

1
2
3
4
5
function add(a, b)
{
        var c = a + b;
        return c;
}

This function defines a lexical scope, a happy little kingdom where the names a, b, and c have precise meanings. They are the two parameters and one local variable declared by the function. The program might use those same names elsewhere, but within add that’s what they refer to. And while lexical scope is a fancy term, it aligns well with our intuitive understanding: after all, we can quite literally see the bloody thing, much as a lexer does, as a textual block in the program’s source.

Having seen stack frames in action, it’s easy to imagine an implementation for this name specificity. Within add, these names refer to stack locations private to each running instance of the function. That’s in fact how it often plays out in a VM.

So let’s nest two lexical scopes:

1
2
3
4
5
6
7
8
9
function makeGreeter()
{
    return function hi(name) {
        console.log('hi, ' + name);
    }
}

var hi = makeGreeter();
hi('dear reader'); // prints "hi, dear reader"

That’s more interesting. Function hi is built at runtime within makeGreeter. It has its own lexical scope, where name is an argument on the stack, but visually it sure looks like it can access its parent’s lexical scope as well, which it can. Let’s take advantage of that:

1
2
3
4
5
6
7
8
9
function makeGreeter(greeting)
{
    return function greet(name) {
        console.log(greeting + ', ' + name);
    }
}

var heya = makeGreeter('HEYA');
heya('dear reader'); // prints "HEYA, dear reader"

A little strange, but pretty cool. There’s something about it though that violates our intuition: greeting sure looks like a stack variable, the kind that should be dead after makeGreeter() returns. And yet, since greet() keeps working, something funny is going on. Enter the closure:

The VM allocated an object to store the parent variable used by the inner greet(). It’s as if makeGreeter’s lexical scope had been closed over at that moment, crystallized into a heap object for as long as needed (in this case, the lifetime of the returned function). Hence the name closure, which makes a lot of sense when you see it that way. If more parent variables had been used (or captured), the Context object would have more properties, one per captured variable. Naturally, the code emitted for greet() knows to read greeting from the Context object, rather than expect it on the stack.

Here’s a fuller example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
function makeGreeter(greetings)
{
    var count = 0;
    var greeter = {};

    for (var i = 0; i < greetings.length; i++) {
        var greeting = greetings[i];

        greeter[greeting] = function(name) {
            count++;
            console.log(greeting + ', ' + name);
        }
    }

    greeter.count = function() { return count; }

    return greeter;
}

var greeter = makeGreeter(["hi", "hello", "howdy"])
greeter.hi('poppet'); // prints "howdy, poppet"
greeter.hello('darling'); // prints "howdy, darling"
greeter.count(); // returns 2

Well… count() works, but our greeter is stuck in howdy. Can you tell why? What we’re doing with count is a clue: even though the lexical scope is closed over into a heap object, the values taken by the variables (or object properties) can still be changed. Here’s what we have:

There is one common context shared by all functions. That’s why count works. But the greeting is also being shared, and it was set to the last value iterated over, “howdy” in this case. That’s a pretty common error, and the easiest way to avoid it is to introduce a function call to take the closed-over variable as an argument. In CoffeeScript, the do command provides an easy way to do so. Here’s a simple solution for our greeter:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
function makeGreeter(greetings)
{
    var count = 0;
    var greeter = {};

    greetings.forEach(function(greeting) {
        greeter[greeting] = function(name) {
            count++;
            console.log(greeting + ', ' + name);
        }
    });

    greeter.count = function() { return count; }

    return greeter;
}

var greeter = makeGreeter(["hi", "hello", "howdy"])
greeter.hi('poppet'); // prints "hi, poppet"
greeter.hello('darling'); // prints "hello, darling"
greeter.count(); // returns 2

It now works, and the result becomes:

That’s a lot of arrows! But here’s the interesting feature: in our code, we closed over two nested lexical contexts, and sure enough we get two linked Context objects in the heap. You could nest and close over many lexical contexts, Russian-doll style, and you end up with essentially a linked list of all these Context objects.

Of course, just as you can implement TCP over carrier pigeons, there are many ways to implement these language features. For example, the ES6 spec defines lexical environments as consisting of an environment record (roughly, the local identifiers within a block) plus a link to an outer environment record, allowing the nesting we have seen. The logical rules are nailed by the spec (one hopes), but it’s up to the implementation to translate them into bits and bytes.

You can also inspect the assembly code produced by V8 for specific cases. Vyacheslav Egorov has great posts and explains this process along with V8 closure internals in detail. I’ve only started studying V8, so pointers and corrections are welcome. If you know C#, inspecting the IL code emitted for closures is enlightening – you will see the analog of V8 Contexts explicitly defined and instantiated.

Closures are powerful beasts. They provide a succinct way to hide information from a caller while sharing it among a set of functions. I love that they truly hide your data: unlike object fields, callers cannot access or even see closed-over variables. Keeps the interface cleaner and safer.

But they’re no silver bullet. Sometimes an object nut and a closure fanatic will argue endlessly about their relative merits. Like most tech discussions, it’s often more about ego than real tradeoffs. At any rate, this epic koan by Anton van Straaten settles the issue:

The venerable master Qc Na was walking with his student, Anton. Hoping to
prompt the master into a discussion, Anton said “Master, I have heard that
objects are a very good thing - is this true?” Qc Na looked pityingly at
his student and replied, “Foolish pupil - objects are merely a poor man’s
closures.”

Chastised, Anton took his leave from his master and returned to his cell,
intent on studying closures. He carefully read the entire “Lambda: The
Ultimate…” series of papers and its cousins, and implemented a small
Scheme interpreter with a closure-based object system. He learned much, and
looked forward to informing his master of his progress.

On his next walk with Qc Na, Anton attempted to impress his master by
saying “Master, I have diligently studied the matter, and now understand
that objects are truly a poor man’s closures.” Qc Na responded by hitting
Anton with his stick, saying “When will you learn? Closures are a poor man’s
object.” At that moment, Anton became enlightened.

And that closes our stack series. In the future I plan to cover other language implementation topics like object binding and vtables. But the call of the kernel is strong, so there’s an OS post coming out tomorrow. I invite you to subscribe and follow me.

Tail Calls, Optimization, and ES6

In this penultimate post about the stack, we take a quick look at tail calls, compiler optimizations, and the proper tail calls landing in the newest version of JavaScript.

A tail call happens when a function F makes a function call as its final action. At that point F will do absolutely no more work: it passes the ball to whatever function is being called and vanishes from the game. This is notable because it opens up the possibility of tail call optimization: instead of creating a new stack frame for the function call, we can simply reuse F’s stack frame, thereby saving stack space and avoiding the work involved in setting up a new frame. Here are some examples in C and their results compiled with mild optimization:

Simple Tail Calls (tail.c) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
int add5(int a)
{
        return a + 5;
}

int add10(int a)
{
        int b = add5(a); // not tail
        return add5(b); // tail
}

int add5AndTriple(int a) {
        int b = add5(a); // not tail
        return 3 * add5(a); // not tail, doing work after the call
}

int finicky(int a) {
        if (a > 10) {
                return add5AndTriple(a); // tail
        }

        if (a > 5) {
                int b = add5(a); // not tail
                return finicky(b); // tail
        }

        return add10(a); // tail
}

You can normally spot tail call optimization (hereafter, TCO) in compiler output by seeing a jump instruction where a call would have been expected. At runtime TCO leads to a reduced call stack.

A common misconception is that tail calls are necessarily recursive. That’s not the case: a tail call may be recursive, such as in finicky() above, but it need not be. As long as caller F is completely done at the call site, we’ve got ourselves a tail call. Whether it can be optimized is a different question whose answer depends on your programming environment.

“Yes, it can, always!” is the best answer we can hope for, which is famously the case for Scheme, as discussed in SICP (by the way, if when you program you don’t feel like “a Sorcerer conjuring the spirits of the computer with your spells,” I urge you to read that book). It’s also the case for Lua. And most importantly, it is the case for the next version of JavaScript, ES6, whose spec does a good job defining tail position and clarifying the few conditions required for optimization, such as strict mode. When a language guarantees TCO, it supports proper tail calls.

Now some of us can’t kick that C habit, heart bleed and all, and the answer there is a more complicated “sometimes” that takes us into compiler optimization territory. We’ve seen the simple examples above; now let’s resurrect our factorial from last post:

Recursive Factorial (factorial.c) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#include <stdio.h>

int factorial(int n)
{
  int previous = 0xdeadbeef; 

  if (n == 0 || n == 1) {
      return 1;
  }

  previous = factorial(n-1);
  return n * previous;
}

int main(int argc)
{
  int answer = factorial(5);
  printf("%d\n", answer);
}

So, is line 11 a tail call? It’s not, because of the multiplication by n afterwards. But if you’re not used to optimizations, gcc’s result with O2 optimization might shock you: not only it transforms factorial into a recursion-free loop, but the factorial(5) call is eliminated entirely and replaced by a compile-time constant of 120 (5! == 120). This is why debugging optimized code can be hard sometimes. On the plus side, if you call this function it will use a single stack frame regardless of n’s initial value. Compiler algorithms are pretty fun, and if you’re interested I suggest you check out Building an Optimizing Compiler and ACDI.

However, what happened here was not tail call optimization, since there was no tail call to begin with. gcc outsmarted us by analyzing what the function does and optimizing away the needless recursion. The task was made easier by the simple, deterministic nature of the operations being done. By adding a dash of chaos (e.g., getpid()) we can throw gcc off:

Recursive PID Factorial (pidFactorial.c) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>

int pidFactorial(int n)
{
        if (1 == n) {
                return getpid(); // tail
        }

        return n * pidFactorial(n-1) * getpid(); // not tail
}

int main(int argc)
{
        int answer = pidFactorial(5);
        printf("%d\n", answer);
}

Optimize that, unix fairies! So now we have a regular recursive call and this function allocates O(n) stack frames to do its work. Heroically, gcc still does TCO for getpid in the recursion base case. If we now wished to make this function tail recursive, we’d need a slight change:

(tailPidFactorial.c) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>

int tailPidFactorial(int n, int acc)
{
        if (1 == n) {
                return acc * getpid(); // not tail
        }

        acc = (acc * getpid() * n);
        return tailPidFactorial(n-1, acc); // tail
}

int main(int argc)
{
        int answer = tailPidFactorial(5, 1);
        printf("%d\n", answer);
}

The accumulation of the result is now a loop and we’ve achieved true TCO. But before you go out partying, what can we say about the general case in C? Sadly, while good C compilers do TCO in a number of cases, there are many situations where they cannot do it. For example, as we saw in our function epilogues, the caller is responsible for cleaning up the stack after a function call using the standard C calling convention. So if function F takes two arguments, it can only make TCO calls to functions taking two or fewer arguments. This is one among many restrictions. Mark Probst wrote an excellent thesis discussing Proper Tail Recursion in C where he discusses these issues along with C stack behavior. He also does insanely cool juggling.

“Sometimes” is a rocky foundation for any relationship, so you can’t rely on TCO in C. It’s a discrete optimization that may or may not take place, rather than a language feature like proper tail calls, though in practice the compiler will optimize the vast majority of cases. But if you must have it, say for transpiling Scheme into C, you will suffer.

Since JavaScript is now the most popular transpilation target, proper tail calls become even more important there. So kudos to ES6 for delivering it along with many other significant improvements. It’s like Christmas for JS programmers.

This concludes our brief tour of tail calls and compiler optimization. Thanks for reading and see you next time.

Recursion: Dream Within a Dream

Recursion is magic, but it suffers from the most awkward introduction in programming books. They’ll show you a recursive factorial implementation, then warn you that while it sort of works it’s terribly slow and might crash due to stack overflows. “You could always dry your hair by sticking your head into the microwave, but watch out for intracranial pressure and head explosions. Or you can use a towel.” No wonder people are suspicious of it. Which is too bad, because recursion is the single most powerful idea in algorithms.

Let’s take a look at the classic recursive factorial:

Recursive Factorial - factorial.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#include <stdio.h>

int factorial(int n)
{
        int previous = 0xdeadbeef;

        if (n == 0 || n == 1) {
                return 1;
        }

        previous = factorial(n-1);
        return n * previous;
}

int main(int argc)
{
        int answer = factorial(5);
        printf("%d\n", answer);
}

The idea of a function calling itself is mystifying at first. To make it concrete, here is exactly what is on the stack when factorial(5) is called and reaches n == 1:

Each call to factorial generates a new stack frame. The creation and destruction of these stack frames is what makes the recursive factorial slower than its iterative counterpart. The accumulation of these frames before the calls start returning is what can potentially exhaust stack space and crash your program.

These concerns are often theoretical. For example, the stack frames for factorial take 16 bytes each (this can vary depending on stack alignment and other factors). If you are running a modern x86 Linux kernel on a computer, you normally have 8 megabytes of stack space, so factorial could handle n up to ~512,000. This is a monstrously large result that takes 8,971,833 bits to represent, so stack space is the least of our problems: a puny integer – even a 64-bit one – will overflow tens of thousands of times over before we run out of stack space.

We’ll look at CPU usage in a moment, but for now let’s take a step back from the bits and bytes and look at recursion as a general technique. Our factorial algorithm boils down to pushing integers N, N-1, … 1 onto a stack, then multiplying them in reverse order. The fact we’re using the program’s call stack to do this is an implementation detail: we could allocate a stack on the heap and use that instead. While the call stack does have special properties, it’s just another data structure at your disposal. I hope the diagram makes that clear.

Once you see the call stack as a data structure, something else becomes clear: piling up all those integers to multiply them afterwards is one dumbass idea. That is the real lameness of this implementation: it’s using a screwdriver to hammer a nail. It’s far more sensible to use an iterative process to calculate factorials.

But there are plenty of screws out there, so let’s pick one. There is a traditional interview question where you’re given a mouse in a maze, and you must help the mouse search for cheese. Suppose the mouse can turn either left or right in the maze. How would you model and solve this problem?

Like most problems in life, you can reduce this rodent quest to a graph, in particular a binary tree where the nodes represent positions in the maze. You could then have the mouse attempt left turns whenever possible, and backtrack to turn right when it reaches a dead end. Here’s the mouse walk in an example maze:

Each edge (line) is a left or right turn taking our mouse to a new position. If either turn is blocked, the corresponding edge does not exist. Now we’re talking! This process is inherently recursive whether you use the call stack or another data structure. But using the call stack is just so easy:

Recursive Maze Solver (maze.c) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#include <stdio.h>
#include "maze.h"

int explore(maze_t *node)
{
  int found = 0;

    if (node == NULL) {
        return 0;
    }

    if (node->hasCheese) {
        return 1; // found cheese
    }

  found = explore(node->left) || explore(node->right);
  return found;
}

int main(int argc)
{
        int found = explore(&maze);
}

Below is the stack when we find the cheese in maze.c:13. You can also see the detailed GDB output and commands used to gather data.

This shows recursion in a much better light because it’s a suitable problem. And that’s no oddity: when it comes to algorithms, recursion is the rule, not the exception. It comes up when we search, when we traverse trees and other data structures, when we parse, when we sort: it’s everywhere. You know how pi or e come up in math all the time because they’re in the foundations of the universe? Recursion is like that: it’s in the fabric of computation.

Steven Skienna’s excellent Algorithm Design Manual is a great place to see that in action as he works through his “war stories” and shows the reasoning behind algorithmic solutions to real-world problems. It’s the best resource I know of to develop your intuition for algorithms. Another good read is McCarthy’s original paper on LISP. Recursion is both in its title and in the foundations of the language. The paper is readable and fun, it’s always a pleasure to see a master at work.

Back to the maze. While it’s hard to get away from recursion here, it doesn’t mean it must be done via the call stack. You could for example use a string like RRLL to keep track of the turns, and rely on the string to decide on the mouse’s next move. Or you can allocate something else to record the state of the cheese hunt. You’d still be implementing a recursive process, but rolling your own data structure.

That’s likely to be more complex because the call stack fits like a glove. Each stack frame records not only the current node, but also the state of computation in that node (in this case, whether we’ve taken only the left, or are already attempting the right). Hence the code becomes trivial. Yet we sometimes give up this sweetness for fear of overflows and hopes of performance. That can be foolish.

As we’ve seen, the stack is large and frequently other constraints kick in before stack space does. One can also check the problem size and ensure it can be handled safely. The CPU worry is instilled chiefly by two widespread pathological examples: the dumb factorial and the hideous O(2n) recursive Fibonacci without memoization. These are not indicative of sane stack-recursive algorithms.

The reality is that stack operations are fast. Often the offsets to data are known exactly, the stack is hot in the caches, and there are dedicated instructions to get things done. Meanwhile, there is substantial overhead involved in using your own heap-allocated data structures. It’s not uncommon to see people write something that ends up more complex and less performant than call-stack recursion. Finally, modern CPUs are pretty good and often not the bottleneck. Be careful about sacrificing simplicity and as always with performance, measure.

The next post is the last in this stack series, and we’ll look at Tail Calls, Closures, and Other Fauna. Then it’ll be time to visit our old friend, the Linux kernel. Thanks for reading!

Epilogues, Canaries, and Buffer Overflows

Last week we looked at how the stack works and how stack frames are built during function prologues. Now it’s time to look at the inverse process as stack frames are destroyed in function epilogues. Let’s bring back our friend add.c:

Simple Add Program - add.c
1
2
3
4
5
6
7
8
9
10
11
int add(int a, int b)
{
  int result = a + b;
  return result;
}

int main(int argc)
{
  int answer;
  answer = add(40, 2);
}

We’re executing line 4, right after the assignment of a + b into result. This is what happens:

The first instruction is redundant and a little silly because we know eax is already equal to result, but this is what you get with optimization turned off. The leave instruction then runs, doing two tasks for the price of one: it resets esp to point to the start of the current stack frame, and then restores the saved ebp value. These two operations are logically distinct and thus are broken up in the diagram, but they happen atomically if you’re tracing with a debugger.

After leave runs the previous stack frame is restored. The only vestige of the call to add is the return address on top of the stack. It contains the address of the instruction in main that must run after add is done. The ret instruction takes care of it: it pops the return address into the eip register, which points to the next instruction to be executed. The program has now returned to main, which resumes:

main copies the return value from add into local variable answer and then runs its own epilogue, which is identical to any other. Again the only peculiarity in main is that the saved ebp is null, since it is the first stack frame in our code. In the last step, execution has been returned to the C runtime (libc), which will exit to the operating system. Here’s a diagram with the full return sequence for those who need it.

You now have an excellent grasp of how the stack operates, so let’s have some fun and look at one of the most infamous hacks of all time: exploiting the stack buffer overflow. Here is a vulnerable program:

Vulnerable Program - buffer.c
1
2
3
4
5
6
7
8
9
10
void doRead()
{
        char buffer[28];
        gets(buffer);
}

int main(int argc)
{
        doRead();
}

The code above uses gets to read from standard input. gets keeps reading until it encounters a newline or end of file. Here’s what the stack looks like after a string has been read:

The problem here is that gets is unaware of buffer’s size: it will blithely keep reading input and stuffing data into the stack beyond buffer, obliterating the saved ebp value, return address, and whatever else is below. To exploit this behavior, attackers craft a precise payload and feed it into the program. This is what the stack looks like during an attack, after the call to gets:

The basic idea is to provide malicious assembly code to be executed and overwrite the return address on the stack to point to that code. It is a bit like a virus invading a cell, subverting it, and introducing some RNA to further its goals.

And like a virus, the exploit’s payload has many notable features. It starts with several nop instructions to increase the odds of successful exploitation. This is because the return address is absolute and must be guessed, since attackers don’t know exactly where in the stack their code will be stored. But as long as they land on a nop, the exploit works: the processor will execute the nops until it hits the instructions that do work.

The exec /bin/sh symbolizes raw assembly instructions that execute a shell (imagine for example that the vulnerability is in a networked program, so the exploit might provide shell access to the system). The idea of feeding raw assembly to a program expecting a command or user input is shocking at first, but that’s part of what makes security research so fun and mind-expanding. To give you an idea of how weird things get, sometimes the vulnerable program calls tolower or toupper on its inputs, forcing attackers to write assembly instructions whose bytes do not fall into the range of upper- or lower-case ascii letters.

Finally, attackers repeat the guessed return address several times, again to tip the odds ever in their favor. By starting on a 4-byte boundary and providing multiple repeats, they are more likely to overwrite the original return address on the stack.

Thankfully, modern operating systems have a host of protections against buffer overflows, including non-executable stacks and stack canaries. The “canary” name comes from the canary in a coal mine expression, an addition to computer science’s rich vocabulary. In the words of Steve McConnell:

Computer science has some of the most colorful language of any field. In what other field can you walk into a sterile room, carefully controlled at 68°F, and find viruses, Trojan horses, worms, bugs, bombs, crashes, flames, twisted sex changers, and fatal errors?

Steve McConnell Code Complete 2

At any rate, here’s what a stack canary looks like:

Canaries are implemented by the compiler. For example, GCC’s stack-protector option causes canaries to be used in any function that is potentially vulnerable. The function prologue loads a magic value into the canary location, and the epilogue makes sure the value is intact. If it’s not, a buffer overflow (or bug) likely happened and the program is aborted via __stack_chk_fail. Due to their strategic location on the stack, canaries make the exploitation of stack buffer overflows much harder.

This finishes our journey within the depths of the stack. We don’t want to delve too greedily and too deep. Next week we’ll go up a notch in abstraction to take a good look at recursion, tail calls and other tidbits, probably using Google’s V8. To end this epilogue and prologue talk, I’ll close with a cherished quote inscribed on a monument in the American National Archives:

Journey to the Stack, Part I

Earlier we’ve explored the anatomy of a program in memory, the landscape of how our programs run in a computer. Now we turn to the call stack, the work horse in most programming languages and virtual machines. Along the way we’ll meet fantastic creatures like closures, recursion, and buffer overflows. But the first step is a precise picture of how the stack operates.

The stack is so important because it keeps track of the functions running in a program, and functions are in turn the building blocks of software. In fact, the internal operation of programs is normally very simple. It consists mostly of functions pushing data onto and popping data off the stack as they call each other, while allocating memory on the heap for data that must survive across function calls. This is true for both low-level C software and VM-based languages like JavaScript and C#. A solid grasp of this reality is invaluable for debugging, performance tuning and generally knowing what the hell is going on.

When a function is called, a stack frame is created to support the function’s execution. The stack frame contains the function’s local variables and the arguments passed to the function by its caller. The frame also contains housekeeping information that allows the called function (the callee) to return to the caller safely. The exact contents and layout of the stack vary by processor architecture and function call convention. In this post we look at Intel x86 stacks using C-style function calls (cdecl). Here’s a single stack frame sitting live on top of the stack:

Right away, three CPU registers burst into the scene. The stack pointer, esp, points to the top of the stack. The top is always occupied by the last item that was pushed onto the stack but has not yet been popped off, just as in a real-world stack of plates or $100 bills.

The address stored in esp constantly changes as stack items are pushed and popped, such that it always points to the last item. Many CPU instructions automatically update esp as a side effect, and it’s impractical to use the stack without this register.

In the Intel architecture, as in most, the stack grows towards lower memory addresses. So the “top” is the lowest memory address in the stack containing live data: local_buffer in this case. Notice there’s nothing vague about the arrow from esp to local_buffer. This arrow means business: it points specifically to the first byte occupied by local_buffer because that is the exact address stored in esp.

The second register tracking the stack is ebp, the base pointer or frame pointer. It points to a fixed location within the stack frame of the function currently running and provides a stable reference point (base) for access to arguments and local variables. ebp changes only when a function call begins or ends. Thus we can easily address each item in the stack as an offset from ebp, as shown in the diagram.

Unlike esp, ebp is mostly maintained by program code with little CPU interference. Sometimes there are performance benefits in ditching ebp altogether, which can be done via compiler flags. The Linux kernel is one example where this is done.

Finally, the eax register is used by convention to transfer return values back to the caller for most C data types.

Now let’s inspect the data in our stack frame. These diagram shows precise byte-for-byte contents as you’d see in a debugger, with memory growing left-to-right, top-to-bottom. Here it is:

The local variable local_buffer is a byte array containing a null-terminated ascii string, a staple of C programs. The string was likely read from somewhere, for example keyboard input or a file, and it is 7 bytes long. Since local_buffer can hold 8 bytes, there’s 1 free byte left. The content of this byte is unknown because in the stack’s infinite dance of pushes and pops, you never know what memory holds unless you write to it. Since the C compiler does not initialize the memory for a stack frame, contents are undetermined – and somewhat random – until written to. This has driven some into madness.

Moving on, local1 is a 4-byte integer and you can see the contents of each byte. It looks like a big number, with all those zeros following the 8, but here your intuition leads you astray.

Intel processors are little endian machines, meaning that numbers in memory start with the little end first. So the least significant byte of a multi-byte number is in the lowest memory address. Since that is normally shown leftmost, this departs from our usual representation of numbers. It helps to know that this endian talk is borrowed from Gulliver’s Travels: just as folks in Lilliput eat their eggs starting from the little end, Intel processors eat their numbers starting from the little byte.

So local1 in fact holds the number 8, as in the legs of an octopus. param1, however, has a value of 2 in the second byte position, so its mathematical value is 2 * 256 = 512 (we multiply by 256 because each place value ranges from 0 to 255). Meanwhile, param2 is carrying weight at 1 * 256 * 256 = 65536.

The housekeeping data in this stack frame consists of two crucial pieces: the address of the previous stack frame (saved ebp) and the address of the instruction to be executed upon the function’s exit (the return address). Together, they make it possible for the function to return sanely and for the program to keep running along.

Now let’s see the birth of a stack frame to build a clear mental picture of how this all works together. Stack growth is puzzling at first because it happens in the opposite direction you’d expect. For example, to allocate 8 bytes on the stack one subtracts 8 from esp, and subtraction is an odd way to grow something.

Let’s take a simple C program:

Simple Add Program - add.c
1
2
3
4
5
6
7
8
9
10
11
int add(int a, int b)
{
  int result = a + b;
  return result;
}

int main(int argc)
{
  int answer;
  answer = add(40, 2);
}

Suppose we run this in Linux without command-line parameters. When you run a C program, the first code to actually execute is in the C runtime library, which then calls our main function. The diagrams below show step-by-step what happens as the program runs. Each diagram links to GDB output showing the state of memory and registers. You may also see the GDB commands used and the whole GDB output. Here we go:

Steps 2 and 3, along with 4 below, are the function prologue, which is common to nearly all functions: the current value of ebp is saved to the top of the stack, and then esp is copied to ebp, establishing a new frame. main’s prologue is like any other, but with the peculiarity that ebp is zeroed out when the program starts.

If you were to inspect the stack below argc (to the right) you’d find more data, including pointers to the program name and command-line parameters (the traditional C argv), plus pointers to Unix environment variables and their actual contents. But that’s not important here, so the ball keeps rolling towards the add() call:

After main subtracts 12 from esp to get the stack space it needs, it sets the values for a and b. Values in memory are shown in hex and little-endian format, as you’d see in a debugger. Once parameter values are set, main calls add and it starts running:

Now there’s some excitement! We get another prologue, but this time you can see clearly how the stack frames form a linked list, starting at ebp and going down the stack. This is how debuggers and Exception objects in higher-level languages get their stack traces. You can also see the much more typical catching up of ebp to esp when a new frame is born. And again, we subtract from esp to get more stack space.

There’s also the slightly weird reversal of bytes when the ebp register value is copied to memory. What’s happening here is that registers don’t really have endianness: there are no “growing addresses” inside a register as there are for memory. Thus by convention debuggers show register values in the most natural format to humans: most significant to least significant digits. So the results of the copy in a little-endian machine appear reversed in the usual left-to-right notation for memory. I want the diagrams to provide an accurate picture of what you find in the trenches, so there you have it.

With the hard part behind us, we add:

There are guest register appearances to help out with the addition, but otherwise no alarms and no surprises. add did its job, and at this point the stack action would go in reverse, but we’ll save that for next time.

Anybody who’s read this far deserves a souvenir, so I’ve made a large diagram showing all the steps combined in a fit of nerd pride.

It looks tame once it’s all laid out. Those little boxes help a lot. In fact, little boxes are the chief tool of computer science. I hope the pictures and register movements provide an intuitive mental picture that integrates stack growth and memory contents. Up close, our software doesn’t look too far from a simple Turing machine.

This concludes the first part of our stack tour. There’s some more byte spelunking ahead, and then it’s on to see higher level programming concepts built on this foundation. See you next week.