Quick Note on Diagrams and the Blog

People often ask me what tool I use to make the diagrams in my Software Illustrated posts. I use MS Visio 2007. It has a ‘themes’ feature that allows you to set fill and line options that apply to all the shapes in a diagram, making it faster to produce decent looking things. It still takes a surprising amount of work to get good pictures, but overall I’m pretty happy.

Also, I have tried to use colors to convey meaning. They’re not just for pretty. For example, memory colors follow these conventions across all diagrams:

Legend for memory colors

These colors hold from the earliest post about memory to the latest. This convention is why the post about Intel CPU caches shows a blue index for the virtually indexed L1 cache. So far I’ve written a lot about kernel and x86 internals, but that’s sort of a coincidence. I’m a generalist, not an OS guy; there’s a wide range of CS topics I hope to write about. (All this internals talk though made me want to write Linux kernel code again. I might look for some subsystem or driver to work on. What’s that sleep supression pill again?)

Finally, in the next couple of months I plan to change my blog template. The new one will have a hand-maintained ‘Archive by Topic’ page to serve as a coherent index to all posts, plus other usability improvements. I hate the current site as far as that goes. I can handle the logic and markup, but if anyone out there is interested in doing a small design/CSS job on this blog, please drop me a line. I also have a quick question. Many people access the site via iPhones and other mobile devices. How does image width impact you? Would it be painful if diagrams were wider than their current 700-pixel limit? I’d appreciate input on this and suggestions in general. Thanks! I’m off to check out the Denver LAMP meetup. Here’s a good song if you’re bored.

Anatomy of a Program in Memory

Memory management is the heart of operating systems; it is crucial for both programming and system administration. In the next few posts I’ll cover memory with an eye towards practical aspects, but without shying away from internals. While the concepts are generic, examples are mostly from Linux and Windows on 32-bit x86. This first post describes how programs are laid out in memory.

Each process in a multi-tasking OS runs in its own memory sandbox. This sandbox is the virtual address space, which in 32-bit mode is always a 4GB block of memory addresses. These virtual addresses are mapped to physical memory by page tables, which are maintained by the operating system kernel and consulted by the processor. Each process has its own set of page tables, but there is a catch. Once virtual addresses are enabled, they apply to all software running in the machine, including the kernel itself. Thus a portion of the virtual address space must be reserved to the kernel:

Kernel/User Memory Split

This does not mean the kernel uses that much physical memory, only that it has that portion of address space available to map whatever physical memory it wishes. Kernel space is flagged in the page tables as exclusive to privileged code (ring 2 or lower), hence a page fault is triggered if user-mode programs try to touch it. In Linux, kernel space is constantly present and maps the same physical memory in all processes. Kernel code and data are always addressable, ready to handle interrupts or system calls at any time. By contrast, the mapping for the user-mode portion of the address space changes whenever a process switch happens:

Process Switch Effects on Virtual Memory

Blue regions represent virtual addresses that are mapped to physical memory, whereas white regions are unmapped. In the example above, Firefox has used far more of its virtual address space due to its legendary memory hunger. The distinct bands in the address space correspond to memory segments like the heap, stack, and so on. Keep in mind these segments are simply a range of memory addresses and have nothing to do with Intel-style segments. Anyway, here is the standard segment layout in a Linux process:

Flexible Process Address Space Layout In Linux

When computing was happy and safe and cuddly, the starting virtual addresses for the segments shown above were exactly the same for nearly every process in a machine. This made it easy to exploit security vulnerabilities remotely. An exploit often needs to reference absolute memory locations: an address on the stack, the address for a library function, etc. Remote attackers must choose this location blindly, counting on the fact that address spaces are all the same. When they are, people get pwned. Thus address space randomization has become popular. Linux randomizes the stack, memory mapping segment, and heap by adding offsets to their starting addresses. Unfortunately the 32-bit address space is pretty tight, leaving little room for randomization and hampering its effectiveness.

The topmost segment in the process address space is the stack, which stores local variables and function parameters in most programming languages. Calling a method or function pushes a new stack frame onto the stack. The stack frame is destroyed when the function returns. This simple design, possible because the data obeys strict LIFO order, means that no complex data structure is needed to track stack contents – a simple pointer to the top of the stack will do. Pushing and popping are thus very fast and deterministic. Also, the constant reuse of stack regions tends to keep active stack memory in the cpu caches, speeding up access. Each thread in a process gets its own stack.

It is possible to exhaust the area mapping the stack by pushing more data than it can fit. This triggers a page fault that is handled in Linux by expand_stack(), which in turn calls acct_stack_growth() to check whether it’s appropriate to grow the stack. If the stack size is below RLIMIT_STACK (usually 8MB), then normally the stack grows and the program continues merrily, unaware of what just happened. This is the normal mechanism whereby stack size adjusts to demand. However, if the maximum stack size has been reached, we have a stack overflow and the program receives a Segmentation Fault. While the mapped stack area expands to meet demand, it does not shrink back when the stack gets smaller. Like the federal budget, it only expands.

Dynamic stack growth is the only situation in which access to an unmapped memory region, shown in white above, might be valid. Any other access to unmapped memory triggers a page fault that results in a Segmentation Fault. Some mapped areas are read-only, hence write attempts to these areas also lead to segfaults.

Below the stack, we have the memory mapping segment. Here the kernel maps contents of files directly to memory. Any application can ask for such a mapping via the Linux mmap() system call (implementation) or CreateFileMapping() / MapViewOfFile() in Windows. Memory mapping is a convenient and high-performance way to do file I/O, so it is used for loading dynamic libraries. It is also possible to create an anonymous memory mapping that does not correspond to any files, being used instead for program data. In Linux, if you request a large block of memory via malloc(), the C library will create such an anonymous mapping instead of using heap memory. ‘Large’ means larger than MMAP_THRESHOLD bytes, 128 kB by default and adjustable via mallopt().

Speaking of the heap, it comes next in our plunge into address space. The heap provides runtime memory allocation, like the stack, meant for data that must outlive the function doing the allocation, unlike the stack. Most languages provide heap management to programs. Satisfying memory requests is thus a joint affair between the language runtime and the kernel. In C, the interface to heap allocation is malloc() and friends, whereas in a garbage-collected language like C# the interface is the new keyword.

If there is enough space in the heap to satisfy a memory request, it can be handled by the language runtime without kernel involvement. Otherwise the heap is enlarged via the brk() system call (implementation) to make room for the requested block. Heap management is complex, requiring sophisticated algorithms that strive for speed and efficient memory usage in the face of our programs’ chaotic allocation patterns. The time needed to service a heap request can vary substantially. Real-time systems have special-purpose allocators to deal with this problem. Heaps also become fragmented, shown below:

Fragmented Heap

Finally, we get to the lowest segments of memory: BSS, data, and program text. Both BSS and data store contents for static (global) variables in C. The difference is that BSS stores the contents of uninitialized static variables, whose values are not set by the programmer in source code. The BSS memory area is anonymous: it does not map any file. If you say static int cntActiveUsers, the contents of cntActiveUsers live in the BSS.

The data segment, on the other hand, holds the contents for static variables initialized in source code. This memory area is not anonymous. It maps the part of the program’s binary image that contains the initial static values given in source code. So if you say static int cntWorkerBees = 10, the contents of cntWorkerBees live in the data segment and start out as 10. Even though the data segment maps a file, it is a private memory mapping, which means that updates to memory are not reflected in the underlying file. This must be the case, otherwise assignments to global variables would change your on-disk binary image. Inconceivable!

The data example in the diagram is trickier because it uses a pointer. In that case, the contents of pointer gonzo – a 4-byte memory address – live in the data segment. The actual string it points to does not, however. The string lives in the text segment, which is read-only and stores all of your code in addition to tidbits like string literals. The text segment also maps your binary file in memory, but writes to this area earn your program a Segmentation Fault. This helps prevent pointer bugs, though not as effectively as avoiding C in the first place. Here’s a diagram showing these segments and our example variables:

ELF Binary Image Mapped Into Memory

You can examine the memory areas in a Linux process by reading the file /proc/pid_of_process/maps. Keep in mind that a segment may contain many areas. For example, each memory mapped file normally has its own area in the mmap segment, and dynamic libraries have extra areas similar to BSS and data. The next post will clarify what ‘area’ really means. Also, sometimes people say “data segment” meaning all of data + bss + heap.

You can examine binary images using the nm and objdump commands to display symbols, their addresses, segments, and so on. Finally, the virtual address layout described above is the “flexible” layout in Linux, which has been the default for a few years. It assumes that we have a value for RLIMIT_STACK. When that’s not the case, Linux reverts back to the “classic” layout shown below:

Classic Process Address Space Layout In Linux

That’s it for virtual address space layout. The next post discusses how the kernel keeps track of these memory areas. Coming up we’ll look at memory mapping, how file reading and writing ties into all this and what memory usage figures mean.

Getting Physical With Memory

When trying to understand complex systems, you can often learn a lot by stripping away abstractions and looking at their lowest levels. In that spirit we take a look at memory and I/O ports in their simplest and most fundamental level: the interface between the processor and bus. These details underlie higher level topics like thread synchronization and the need for the Core i7. Also, since I’m a programmer I ignore things EE people care about. Here’s our friend the Core 2 again:

Physical Memory Access

A Core 2 processor has 775 pins, about half of which only provide power and carry no data. Once you group the pins by functionality, the physical interface to the processor is surprisingly simple. The diagram shows the key pins involved in a memory or I/O port operation: address lines, data pins, and request pins. These operations take place in the context of a transaction on the front side bus. FSB transactions go through 5 phases: arbitration, request, snoop, response, and data. Throughout these phases, different roles are played by the components on the FSB, which are called agents. Normally the agents are all the processors plus the northbridge.

We only look at the request phase in this post, in which 2 packets are output by the request agent, who is usually a processor. Here are the juiciest bits of the first packet, output by the address and request pins:

FSB Request Phase, Packet A

The address lines output the starting physical memory address for the transaction. We have 33 bits but they are interpreted as bits 35-3 of an address in which bits 2-0 are zero. Hence we have a 36-bit address, aligned to 8 bytes, for a total of 64GB addressable physical memory. This has been the case since the Pentium Pro. The request pins specify what type of transaction is being initiated; in I/O requests the address pins specify an I/O port rather than a memory address. After the first packet is output, the same pins transmit a second packet in the subsequent bus clock cycle:

FSB Request Phase, Packet B

The attribute signals are interesting: they reflect the 5 types of memory caching behavior available in Intel processors. By putting this information on the FSB, the request agent lets other processors know how this transaction affects their caches, and how the memory controller (northbridge) should behave. The processor determines the type of a given memory region mainly by looking at page tables, which are maintained by the kernel.

Typically kernels treat all RAM memory as write-back, which yields the best performance. In write-back mode the unit of memory access is the cache line, 64 bytes in the Core 2. If a program reads a single byte in memory, the processor loads the whole cache line that contains that byte into the L2 and L1 caches. When a program writes to memory, the processor only modifies the line in the cache, but does not update main memory. Later, when it becomes necessary to post the modified line to the bus, the whole cache line is written at once. So most requests have 11 in their length field, for 64 bytes. Here’s a read example in which the data is not in the caches:

Memory Read Sequence Diagram

Some of the physical memory range in an Intel computer is mapped to devices like hard drives and network cards instead of actual RAM memory. This allows drivers to communicate with their devices by writing to and reading from memory. The kernel marks these memory regions as uncacheable in the page tables. Accesses to uncacheable memory regions are reproduced in the bus exactly as requested by a program or driver. Hence it’s possible to read or write single bytes, words, and so on. This is done via the byte enable mask in packet B above.

The primitives discussed here have many implications. For example:

  1. Performance-sensitive applications should try to pack data that is accessed together into the same cache line. Once the cache line is loaded, further reads are much faster and extra RAM accesses are avoided.
  2. Any memory access that falls within a single cache line is guaranteed to be atomic (assuming write-back memory). Such an access is serviced by the processor’s L1 cache and the data is read or written all at once; it cannot be affected halfway by other processors or threads. In particular, 32-bit and 64-bit operations that don’t cross cache line boundaries are atomic.
  3. The front bus is shared by all agents, who must arbitrate for bus ownership before they can start a transaction. Moreover, all agents must listen to all transactions in order to maintain cache coherence. Thus bus contention becomes a severe problem as more cores and processors are added to Intel computers. The Core i7 solves this by having processors attached directly to memory and communicating in a point-to-point rather than broadcast fashion.

These are the highlights of physical memory requests; the bus will surface again later in connection with locking, multi-threading, and cache coherence. The first time I saw FSB packet descriptions I had a huge “ahhh!” moment so I hope someone out there gets the same benefit. In the next post we’ll go back up the abstraction ladder to take a thorough look at virtual memory.

Programming and the Recession

Back in April of last year I wrote about job numbers and trends for programming languages. Since the newspapers tell me we’re all doomed to the soup line in the near future, I decided to compare the job numbers from last year to what we have now. Here’s the result:

Job Losses Per Language

All the numbers are from Dice.com, a rough measure to be sure, but useful nonetheless. The average decrease in the number of jobs was 40%, which seems pretty bad. I don’t know what the supply side looks like, but I imagine we now have more job seekers as well. Here is the % decrease in number of jobs, by programming language:

% decrease per language

Interesting to see how Python and Ruby held out a bit better, while Perl declined the most. But despite a rocky short term, the overall picture for software engineers looks great according to the US Bureau of Labor Statistics:

  • Computer software engineers are one of the occupations projected to grow the fastest and add the most new jobs over the 2006-16 decade.
  • Excellent job prospects are expected for applicants with at least bachelor’s degree in computer engineering or computer science and with practical work experience. (…)

Employment change. Employment of computer software engineers is projected to increase by 38 percent over the 2006 to 2016 period, which is much faster than the average for all occupations. This occupation will generate about 324,000 new jobs over the projections decade, one of the largest employment increases of any occupation.

Not bad huh? Back in 2001 I pestered the BLS economists about this, asking them what they thought offshoring effects would be and so on. They came across as truly bullish on programming, which makes sense to me. The degree to which society depends on computers and programmers will only grow from here. Meanwhile, there is a natural barrier to entry when it comes to programming. When explaining the folly of projects that aim to develop software to replace programmers, Scott Westfall put it this way:

Programmers think more logically. Working through if-then-else conditions is a core capability for any programmer. While working with business teams on requirements, I have often run across cases the where same ability was lacking. (…)

Programmers have a superior ability to analyze problems and come up with solutions. They excel at analyzing preconditions, sequences of events, and outcomes. Certainly, this is a key skill in programming, but it is also useful in troubleshooting and business case analysis.(…)

While people typically think of programmers as coders, whose main talent lies in writing the arcane syntax of programming languages. I think that their main talent lies in their ability to analyze, troubleshoot, and solve problems. Code is just the physical manifestation that culminates the thought process of the programmer.(…)

I see two major consequences of this. First, the supply of programmers is constrained because the work requires a fair bit of aptitude that cannot be replaced by training. Second, programmers have a lot of professional options due to these skills, which further hurts supply. I think the economics is in our favor and we’re still lucky to be programmers, though we must be careful during the recession. What do you say? How does it look out there?

Update: you guys have brought up a number of points about the ‘methodology’ behind the Dice.com job numbers. For example, there are seasonal effects on hiring, so it would have been better to compare the two same months. Also, there may be a drop in the usage of Dice.com itself, rather than a drop in the number of available jobs. Besides, many good companies and applicants have turned away from Dice because of poor results for both sides. That is all true. I look at the Dice.com figures as a rough metric. But as a large tech jobs site I think Dice reflects the market at large, albeit imperfectly. A drop of 40% is significant enough that I find it likely it’s a real phenomenon.

Daddy, why does it smell like fire?

So on Sunday morning my daughter runs up to me and asks me why it “smells like fire.”

“FIRE?”

“Yes fire, weee, let’s find the fire! Fire fire fire, woohoo!”

Should I stop reading my book? Run? Panic? Let the house burn down, collect the insurance money, take a hippie name and start a farm?

After a couple of minutes we narrow down the fire smell to my office. Oh shit! I start smelling the backs of my computers, the power supply fans. “GOOD LORD NO!!! NOT MY COMPUTER, NO NO, BURN MY WIFE’S COMPUTER INSTEAD!!!”

Alas, it was my computer. But at least not the main computer, so I count my blessings. At this point I don’t even know what the hell burned, though the computer is dead (turns on, PSU fan comes up, nothing else). I open up the bugger and start looking for some blackened component but find none. No signs of damage. I start smelling all over the motherboard, hard drives, processor, snorting up all sorts of carcinogens and shortening my life by at least a couple years, but it’s all uniformly stinky as far as my nose can tell.

I always buy good power supplies, this one was a ~500W HIPER, sure looked high quality to me. You know, HIPER quality. I’ve never had anything like this before, so not sure what to think. I wonder if all components are damaged. Is this a PSU failure?

Cache: a place for concealment and safekeeping

This post shows briefly how CPU caches are organized in modern Intel processors. Cache discussions often lack concrete examples, obfuscating the simple concepts involved. Or maybe my pretty little head is slow. At any rate, here’s half the story on how a Core 2 L1 cache is accessed:

Selecting an L1 cache set (row)

The unit of data in the cache is the line, which is just a contiguous chunk of bytes in memory. This cache uses 64-byte lines. The lines are stored in cache banks or ways, and each way has a dedicated directory to store its housekeeping information. You can imagine each way and its directory as columns in a spreadsheet, in which case the rows are the sets. Then each cell in the way column contains a cache line, tracked by the corresponding cell in the directory. This particular cache has 64 sets and 8 ways, hence 512 cells to store cache lines, which adds up to 32KB of space.

In this cache’s view of the world, physical memory is divided into 4KB physical pages. Each page has 4KB / 64 bytes == 64 cache lines in it. When you look at a 4KB page, bytes 0 through 63 within that page are in the first cache line, bytes 64-127 in the second cache line, and so on. The pattern repeats for each page, so the 3rd line in page 0 is different than the 3rd line in page 1.

In a fully associative cache any line in memory can be stored in any of the cache cells. This makes storage flexible, but it becomes expensive to search for cells when accessing them. Since the L1 and L2 caches operate under tight constraints of power consumption, physical space, and speed, a fully associative cache is not a good trade off in most scenarios.

Instead, this cache is set associative, which means that a given line in memory can only be stored in one specific set (or row) shown above. So the first line of any physical page (bytes 0-63 within a page) must be stored in row 0, the second line in row 1, etc. Each row has 8 cells available to store the cache lines it is associated with, making this an 8-way associative set. When looking at a memory address, bits 11-6 determine the line number within the 4KB page and therefore the set to be used. For example, physical address 0x800010a0 has 000010 in those bits so it must be stored in set 2.

But we still have the problem of finding which cell in the row holds the data, if any. That’s where the directory comes in. Each cached line is tagged by its corresponding directory cell; the tag is simply the number for the page where the line came from. The processor can address 64GB of physical RAM, so there are 64GB / 4KB == 224 of these pages and thus we need 24 bits for our tag. Our example physical address 0x800010a0 corresponds to page number 524,289. Here’s the second half of the story:

Finding cache line by matching tags

Since we only need to look in one set of 8 ways, the tag matching is very fast; in fact, electrically all tags are compared simultaneously, which I tried to show with the arrows. If there’s a valid cache line with a matching tag, we have a cache hit. Otherwise, the request is forwarded to the L2 cache, and failing that to main system memory. Intel builds large L2 caches by playing with the size and quantity of the ways, but the design is the same. For example, you could turn this into a 64KB cache by adding 8 more ways. Then increase the number of sets to 4096 and each way can store 256KB. These two modifications would deliver a 4MB L2 cache. In this scenario, you’d need 18 bits for the tags and 12 for the set index; the physical page size used by the cache is equal to its way size.

If a set fills up, then a cache line must be evicted before another one can be stored. To avoid this, performance-sensitive programs try to organize their data so that memory accesses are evenly spread among cache lines. For example, suppose a program has an array of 512-byte objects such that some objects are 4KB apart in memory. Fields in these objects fall into the same lines and compete for the same cache set. If the program frequently accesses a given field (e.g., the vtable by calling a virtual method), the set will likely fill up and the cache will start trashing as lines are repeatedly evicted and later reloaded. Our example L1 cache can only hold the vtables for 8 of these objects due to set size. This is the cost of the set associativity trade-off: we can get cache misses due to set conflicts even when overall cache usage is not heavy. However, due to the relative speeds in a computer, most apps don’t need to worry about this anyway.

A memory access usually starts with a linear (virtual) address, so the L1 cache relies on the paging unit to obtain the physical page address used for the cache tags. By contrast, the set index comes from the least significant bits of the linear address and is used without translation (bits 11-6 in our example). Hence the L1 cache is physically tagged but virtually indexed, helping the CPU to parallelize lookup operations. Because the L1 way is never bigger than an MMU page, a given physical memory location is guaranteed to be associated with the same set even with virtual indexing. L2 caches, on the other hand, must be physically tagged and physically indexed because their way size can be bigger than MMU pages. But then again, by the time a request gets to the L2 cache the physical address was already resolved by the L1 cache, so it works out nicely.

Finally, a directory cell also stores the state of its corresponding cached line. A line in the L1 code cache is either Invalid or Shared (which means valid, really). In the L1 data cache and the L2 cache, a line can be in any of the 4 MESI states: Modified, Exclusive, Shared, or Invalid. Intel caches are inclusive: the contents of the L1 cache are duplicated in the L2 cache. These states will play a part in later posts about threading, locking, and that kind of stuff. Next time we’ll look at the front side bus and how memory access really works. This is going to be memory week.

Update: Dave brought up direct-mapped caches in a comment below. They’re basically a special case of set-associative caches that have only one way. In the trade-off spectrum, they’re the opposite of fully associative caches: blazing fast access, lots of conflict misses.

Tagging comments with Must/Should/May

Internet protocols are described by RFCs – request for comments – issued by the Internet Engineering Task Force. The words “must”, “should”, and “may” are used often in these documents to describe what hosts need to do in various situations. As such, the words themselves are defined in RFC 2119.

Back when I did Unix network programming I started to use these words for tagging code comments. Sort of like “hack:” or “todo:” comments, but with a built-in priority. This works great for me, as I can then search for “MUST:”, “SHOULD:”, and “MAY:” tags in the code and see the stuff prioritized. “MUST:” flags unshippable issues, “SHOULD:” is serious business and should be near zero, and “MAY:” is for possible refactorings and low priority stuff. Ideally all tags are temporary of course, as issues are resolved one way or another.

In general I believe in writing expressive code rather than comments, but these tags have come in handy. Plus they’re fun – there’s something to be said for cultivating the quirky traditions of computing.

Jon Postel

Counting Infinity

Infinity is a fascinating idea and it is behind some of the most beautiful results in mathematics. Much of this beauty is accessible to everyone, brought to us by the brilliant Georg Cantor using simple arguments that require little math. I hope to show some of this goodness to people who haven’t seen it before. Here we go.

Imagine a shepherd tending a flock of a few dozen sheep. In the morning, the sheep get out of the farm to do sheepish things. In the evening, the shepherd wants to make sure he got all his sheep back. One problem: he can’t count. What’s a shepherd to do? One solution would be to keep track of the sheep as they go out. For example, he could throw a pebble into a bucket for each sheep leaving the farm. In the evening, a pebble goes out for each returned sheep. At any given time, the number of pebbles in the bucket is exactly equal to that of outstanding sheep. You can picture the sets of sheep and pebbles like so:

Bijection of sheep and pebbles

Each sheep is paired with a distinct pebble and there are no left overs on either side. In set theory this is a bijection. Even though the sheep have not been counted, we know the sets have the same number of elements. They are equivalent in a way, not because you eat pebbles or throw sheep, though I suppose you could, but because of the bijection. If you accept this premise, which is reasonable enough, then you’re in for some fun. Interesting things happen when we use bijections to compare sets nobody can count: the infinite sets. For example, let’s compare the natural numbers (1, 2, 3…) to the even natural numbers (2, 4, 6…). It sounds like we should have less even numbers, but lo and behold, we get this:

Bijection of naturals and even naturals

Every natural number has been paired with a distinct even number. No left overs. Strangely, these sets are equivalent. It turns out many sets are equivalent to the natural numbers; we call these sets denumerable. For example, the set of Turing Machines is denumerable because there are infinitely many machines and they can each be fully described by a distinct natural number. The integers are yet another example:

Bijection of naturals and integers

Some people see these results as trivial: their intuitive notion is that any “infinity” must be the same. Others see a paradox. How can a part of something be equivalent to the whole? Let’s try to give everyone a paradox by finding a set that is not denumerable. How about the rationals? Surely there must be more fractions than natural numbers! Right? Cantor probably thought so, until he discovered a procedure to enumerate the rationals:

Counting Rationals

Starting at the top and following the pattern, this procedure hits each rational number once. As it moves along, we can associate successive natural numbers with each hit. Here are the first few pairings:

Natural and Rational Pairings

It is pretty clear that the arrangement above covers all of the rationals, but the crucial point is that Cantor’s pattern will reach any given rational in a finite number of steps. The zigzag is important: if we go off in a single direction (say, across the first row to the right) then we get stuck in the infinitude of that row alone, and there would be rationals left over. The zigzag delivers us from that evil. This argument does not produce an equation for a bijection, but since it builds a listing that contains every rational number, it shows they are denumerable. Conversely, if a set is denumerable its elements can be put into a listing where each element is paired with a natural number.

(As an aside, the 1999 paper Recounting the rationals establishes a precise bijection by using a tree to generate all fractions in reduced form. The tree has all sorts of magical properties, cool stuff. It’s an accessible paper, plus Brent Yorgey wrote a great walk through of it.)

All of our sets so far have been denumerable, except for the sheep who were finite. At this point, Cantor might have wondered if maybe there is only one infinity. But then again, if you build a number line using the sets we have seen so far, you end up with holes. The naturals are mere dots on the number line, most fractions fall in a hole:

A Hole In the Naturals

The rationals, on the other hand, are dense. Between any two given rationals, there is another rational (an infinite number of them actually). Yet the rational number line also has holes, which are the irrational numbers. Below is a famous irrational number, on whose account its discoverer was allegedly drowned at sea, shown along the rational number line:

A Hole In the Rationals

Irrational numbers cannot be expressed as a fraction; moreover, when written out as numbers using positional notation (say, as decimal numbers) their digits never settle into any kind of periodic pattern. Fractions, by contrast, eventually settle into a pattern when written out (e.g., 0.333…, 0.25000…). All of the rationals plus all of the irrational numbers make up the real numbers. And those guys form a number line free of holes. Any point in the number line is a real number, any crazy sequence of digits you come up with is a real number. It’s the continuum. You might get the impression that irrational numbers are rare, oddballs among the familiar rationals, but that’s not the case at all. The sparse number line above is an attempt at visualizing that fact, in addition to showcasing my superior preschool MS Paint skillz. The reals are a vast ocean of irrationals pointed here and there by a fraction, and we’ll see why.

So now Cantor pits the real numbers against the naturals. He shows that even a section of the reals – the interval between 0 and 1 – is not denumerable. He uses a proof by contradiction, which is a common way to show that something cannot be true. It goes like this: suppose there is a way to enumerate the reals between 0 and 1. Then there is a listing containing every real number in that interval, each one associated with a different natural number. It might look like this:

Cantor's diagonal

The actual numbers shown above are not important, since the complete list is infinite and has every real number in the interval. Some of the reals up there are clearly rationals and have settled into a periodic pattern (0.250… and 0.500…). The others look pretty random, they could be irrational or maybe they haven’t settled into their periodic patterns yet. Either way, for each number the digits go on indefinitely, rationals in a pattern (possibly of zeros) and irrationals randomly.

Notice how the numbers in red form a diagonal, which is itself infinite. Let’s build a number using that diagonal. Going down the list, for each row we pick a digit that is different than the digit in red. For example, we could add 1 to each digit (9 becomes 0, no carry). You can use any rule as long as the digit is different. Here are the first few digits we get by adding 1 to the red diagonal digits:

Diagonal paradox

The digits of this number p are random and go on infinitely; it is a real number. But due to how we have defined p, it is different from every other number in the list. It differs in at least one digit. Even though the list is infinite, this number is not in it! But then, this list was supposed to have all real numbers so we have a contradiction. Hence our initial assumption must be false; the reals are not denumerable.

The notion of different infinities was a revolutionary one in mathematics. It is a stunning result which ignited the mother of all mathematician flame wars. This was actually the second proof Cantor offered for the non-denumerability of the continuum. The first proof is also valid, but slightly more technical. Cantor’s diagonal argument however is amazingly simple and highly original, as brilliant as its conclusion. It went on to feature prominently in other intellectual landmarks. Alan Turing used it while proving that a Turing Machine cannot predict the behavior of other machines (specifically, predict whether another machine is “cycle-free”, which is now known as the Halting Problem). Gödel used it in his Incompleteness Theorem discussed in Gödel, Escher, Bach. Diagonalization is crucial in recursion theory and complexity theory.

In order to capture these new concepts, Cantor proposed cardinality as a measure of how numerous a set is. A finite set’s cardinality is simply how many elements it has, say 49 for our set of sheep. But the cardinality of an infinite set is expressed by a transfinite number. To make them more esoteric, Cantor used the Hebrew letter aleph for these numbers. By definition, the naturals took aleph zero, Aleph zero, for their cardinality. Every denumerable set has cardinality Aleph zero. The cardinality of the reals was defined as c, and Cantor spent years trying to discover whether c = Aleph one, that is, whether the cardinality of the reals is the ‘next bigger one’ after the naturals. So that explains the name in Mona Lisa Overdrive and also the handle for Aleph One, who wrote the classic paper on stack buffer overflows during the Phrack renaissance.

This concludes our first foray into the infinities. There’s way too much to talk about. If there’s interest I’ll write a follow up to see where else infinity leads us. Iterative rather than upfront blogging :) Meanwhile, Journey Through Genius is a great book that does for many theorems, including Cantor’s, what I did on this post. Only much better. Thanks for reading!