The previous post explained how computers boot up right up to the point where the boot loader, after stuffing the kernel image into memory, is about to jump into the kernel entry point. This last post about booting takes a look at the guts of the kernel to see how an operating system starts life. Since I have an empirical bent I’ll link heavily to the sources for Linux kernel 188.8.131.52 at the Linux Cross Reference. The sources are very readable if you are familiar with C-like syntax; even if you miss some details you can get the gist of what’s happening. The main obstacle is the lack of context around some of the code, such as when or why it runs or the underlying features of the machine. I hope to provide a bit of that context. Due to brevity (hah!) a lot of fun stuff – like interrupts and memory – gets only a nod for now. The post ends with the highlights for the Windows boot.
At this point in the Intel x86 boot story the processor is running in real-mode, is able to address 1 MB of memory, and RAM looks like this for a modern Linux system:
RAM contents after boot loader is done
The kernel image has been loaded to memory by the boot loader using the BIOS disk I/O services. This image is an exact copy of the file in your hard drive that contains the kernel, e.g. /boot/vmlinuz-2.6.22-14-server. The image is split into two pieces: a small part containing the real-mode kernel code is loaded below the 640K barrier; the bulk of the kernel, which runs in protected mode, is loaded after the first megabyte of memory.
The action starts in the real-mode kernel header pictured above. This region of memory is used to implement the Linux boot protocol between the boot loader and the kernel. Some of the values there are read by the boot loader while doing its work. These include amenities such as a human-readable string containing the kernel version, but also crucial information like the size of the real-mode kernel piece. The boot loader also writes values to this region, such as the memory address for the command-line parameters given by the user in the boot menu. Once the boot loader is finished it has filled in all of the parameters required by the kernel header. It’s then time to jump into the kernel entry point. The diagram below shows the code sequence for the kernel initialization, along with source directories, files, and line numbers:
Architecture-specific Linux Kernel Initialization
The early kernel start-up for the Intel architecture is in file arch/x86/boot/header.S. It’s in assembly language, which is rare for the kernel at large but common for boot code. The start of this file actually contains boot sector code, a left over from the days when Linux could work without a boot loader. Nowadays this boot sector, if executed, only prints a “bugger_off_msg” to the user and reboots. Modern boot loaders ignore this legacy code. After the boot sector code we have the first 15 bytes of the real-mode kernel header; these two pieces together add up to 512 bytes, the size of a typical disk sector on Intel hardware.
After these 512 bytes, at offset 0×200, we find the very first instruction that runs as part of the Linux kernel: the real-mode entry point. It’s in header.S:110 and it is a 2-byte jump written directly in machine code as 0x3aeb. You can verify this by running hexdump on your kernel image and seeing the bytes at that offset – just a sanity check to make sure it’s not all a dream. The boot loader jumps into this location when it is finished, which in turn jumps to header.S:229 where we have a regular assembly routine called start_of_setup. This short routine sets up a stack, zeroes the bss segment (the area that contains static variables, so they start with zero values) for the real-mode kernel and then jumps to good old C code at arch/x86/boot/main.c:122.
main() does some house keeping like detecting memory layout, setting a video mode, etc. It then calls go_to_protected_mode(). Before the CPU can be set to protected mode, however, a few tasks must be done. There are two main issues: interrupts and memory. In real-mode the interrupt vector table for the processor is always at memory address 0, whereas in protected mode the location of the interrupt vector table is stored in a CPU register called IDTR. Meanwhile, the translation of logical memory addresses (the ones programs manipulate) to linear memory addresses (a raw number from 0 to the top of the memory) is different between real-mode and protected mode. Protected mode requires a register called GDTR to be loaded with the address of a Global Descriptor Table for memory. So go_to_protected_mode() calls setup_idt() and setup_gdt() to install a temporary interrupt descriptor table and global descriptor table.
We’re now ready for the plunge into protected mode, which is done by protected_mode_jump, another assembly routine. This routine enables protected mode by setting the PE bit in the CR0 CPU register. At this point we’re running with paging disabled; paging is an optional feature of the processor, even in protected mode, and there’s no need for it yet. What’s important is that we’re no longer confined to the 640K barrier and can now address up to 4GB of RAM. The routine then calls the 32-bit kernel entry point, which is startup_32 for compressed kernels. This routine does some basic register initializations and calls decompress_kernel(), a C function to do the actual decompression.
decompress_kernel() prints the familiar “Decompressing Linux…” message. Decompression happens in-place and once it’s finished the uncompressed kernel image has overwritten the compressed one pictured in the first diagram. Hence the uncompressed contents also start at 1MB. decompress_kernel() then prints “done.” and the comforting “Booting the kernel.” By “Booting” it means a jump to the final entry point in this whole story, given to Linus by God himself atop Mountain Halti, which is the protected-mode kernel entry point at the start of the second megabyte of RAM (0×100000). That sacred location contains a routine called, uh, startup_32. But this one is in a different directory, you see.
The second incarnation of startup_32 is also an assembly routine, but it contains 32-bit mode initializations. It clears the bss segment for the protected-mode kernel (which is the true kernel that will now run until the machine reboots or shuts down), sets up the final global descriptor table for memory, builds page tables so that paging can be turned on, enables paging, initializes a stack, creates the final interrupt descriptor table, and finally jumps to to the architecture-independent kernel start-up, start_kernel(). The diagram below shows the code flow for the last leg of the boot:
Architecture-independent Linux Kernel Initialization
start_kernel() looks more like typical kernel code, which is nearly all C and machine independent. The function is a long list of calls to initializations of the various kernel subsystems and data structures. These include the scheduler, memory zones, time keeping, and so on. start_kernel() then calls rest_init(), at which point things are almost all working. rest_init() creates a kernel thread passing another function, kernel_init(), as the entry point. rest_init() then calls schedule() to kickstart task scheduling and goes to sleep by calling cpu_idle(), which is the idle thread for the Linux kernel. cpu_idle() runs forever and so does process zero, which hosts it. Whenever there is work to do – a runnable process – process zero gets booted out of the CPU, only to return when no runnable processes are available.
But here’s the kicker for us. This idle loop is the end of the long thread we followed since boot, it’s the final descendent of the very first jump executed by the processor after power up. All of this mess, from reset vector to BIOS to MBR to boot loader to real-mode kernel to protected-mode kernel, all of it leads right here, jump by jump by jump it ends in the idle loop for the boot processor, cpu_idle(). Which is really kind of cool. However, this can’t be the whole story otherwise the computer would do no work.
At this point, the kernel thread started previously is ready to kick in, displacing process 0 and its idle thread. And so it does, at which point kernel_init() starts running since it was given as the thread entry point. kernel_init() is responsible for initializing the remaining CPUs in the system, which have been halted since boot. All of the code we’ve seen so far has been executed in a single CPU, called the boot processor. As the other CPUs, called application processors, are started they come up in real-mode and must run through several initializations as well. Many of the code paths are common, as you can see in the code for startup_32, but there are slight forks taken by the late-coming application processors. Finally, kernel_init() calls init_post(), which tries to execute a user-mode process in the following order: /sbin/init, /etc/init, /bin/init, and /bin/sh. If all fail, the kernel will panic. Luckily init is usually there, and starts running as PID 1. It checks its configuration file to figure out which processes to launch, which might include X11 Windows, programs for logging in on the console, network daemons, and so on. Thus ends the boot process as yet another Linux box starts running somewhere. May your uptime be long and untroubled.
The process for Windows is similar in many ways, given the common architecture. Many of the same problems are faced and similar initializations must be done. When it comes to boot one of the biggest differences is that Windows packs all of the real-mode kernel code, and some of the initial protected mode code, into the boot loader itself (C:\NTLDR). So instead of having two regions in the same kernel image, Windows uses different binary images. Plus Linux completely separates boot loader and kernel; in a way this automatically falls out of the open source process. The diagram below shows the main bits for the Windows kernel:
Windows Kernel Initialization
The Windows user-mode start-up is naturally very different. There’s no /sbin/init, but rather Csrss.exe and Winlogon.exe. Winlogon spawns Services.exe, which starts all of the Windows Services, and Lsass.exe, the local security authentication subsystem. The classic Windows login dialog runs in the context of Winlogon.
This is the end of this boot series. Thanks everyone for reading and for feedback. I’m sorry some things got superficial treatment; I’ve gotta start somewhere and only so much fits into blog-sized bites. But nothing like a day after the next; my plan is to do regular “Software Illustrated” posts like this series along with other topics. Meanwhile, here are some resources:
- The best, most important resource, is source code for real kernels, either Linux or one of the BSDs.
- Intel publishes excellent Software Developer’s Manuals, which you can download for free.
- Understanding the Linux Kernel is a good book and walks through a lot of the Linux Kernel sources. It’s getting outdated and it’s dry, but I’d still recommend it to anyone who wants to grok the kernel. Linux Device Drivers is more fun, teaches well, but is limited in scope. Finally, Patrick Moroney suggested Linux Kernel Development by Robert Love in the comments for this post. I’ve heard other positive reviews for that book, so it sounds worth checking out.
- For Windows, the best reference by far is Windows Internals by David Solomon and Mark Russinovich, the latter of Sysinternals fame. This is a great book, well-written and thorough. The main downside is the lack of source code.
The previous post described motherboards and the memory map in Intel computers to set the scene for the initial phases of boot. Booting is an involved, hacky, multi-stage affair – fun stuff. Here’s an outline of the process:
An outline of the boot sequence
Things start rolling when you press the power button on the computer (no! do tell!). Once the motherboard is powered up it initializes its own firmware – the chipset and other tidbits – and tries to get the CPU running. If things fail at this point (e.g., the CPU is busted or missing) then you will likely have a system that looks completely dead except for rotating fans. A few motherboards manage to emit beeps for an absent or faulty CPU, but the zombie-with-fans state is the most common scenario based on my experience. Sometimes USB or other devices can cause this to happen: unplugging all non-essential devices is a possible cure for a system that was working and suddenly appears dead like this. You can then single out the culprit device by elimination.
If all is well the CPU starts running. In a multi-processor or multi-core system one CPU is dynamically chosen to be the bootstrap processor (BSP) that runs all of the BIOS and kernel initialization code. The remaining processors, called application processors (AP) at this point, remain halted until later on when they are explicitly activated by the kernel. Intel CPUs have been evolving over the years but they’re fully backwards compatible, so modern CPUs can behave like the original 1978 Intel 8086, which is exactly what they do after power up. In this primitive power up state the processor is in real mode with memory paging disabled. This is like ancient MS-DOS where only 1 MB of memory can be addressed and any code can write to any place in memory – there’s no notion of protection or privilege.
Most registers in the CPU have well-defined values after power up, including the instruction pointer (EIP) which holds the memory address for the instruction being executed by the CPU. Intel CPUs use a hack whereby even though only 1MB of memory can be addressed at power up, a hidden base address (an offset, essentially) is applied to EIP so that the first instruction executed is at address 0xFFFFFFF0 (16 bytes short of the end of 4 gigs of memory and well above one megabyte). This magical address is called the reset vector and is standard for modern Intel CPUs.
The motherboard ensures that the instruction at the reset vector is a jump to the memory location mapped to the BIOS entry point. This jump implicitly clears the hidden base address present at power up. All of these memory locations have the right contents needed by the CPU thanks to the memory map kept by the chipset. They are all mapped to flash memory containing the BIOS since at this point the RAM modules have random crap in them. An example of the relevant memory regions is shown below:
Important memory regions during boot
The CPU then starts executing BIOS code, which initializes some of the hardware in the machine. Afterwards the BIOS kicks off the Power-on Self Test (POST) which tests various components in the computer. Lack of a working video card fails the POST and causes the BIOS to halt and emit beeps to let you know what’s wrong, since messages on the screen aren’t an option. A working video card takes us to a stage where the computer looks alive: manufacturer logos are printed, memory starts to be tested, angels blare their horns. Other POST failures, like a missing keyboard, lead to halts with an error message on the screen. The POST involves a mixture of testing and initialization, including sorting out all the resources – interrupts, memory ranges, I/O ports – for PCI devices. Modern BIOSes that follow the Advanced Configuration and Power Interface build a number of data tables that describe the devices in the computer; these tables are later used by the kernel.
After the POST the BIOS wants to boot up an operating system, which must be found somewhere: hard drives, CD-ROM drives, floppy disks, etc. The actual order in which the BIOS seeks a boot device is user configurable. If there is no suitable boot device the BIOS halts with a complaint like “Non-System Disk or Disk Error.” A dead hard drive might present with this symptom. Hopefully this doesn’t happen and the BIOS finds a working disk allowing the boot to proceed.
The BIOS now reads the first 512-byte sector (sector zero) of the hard disk. This is called the Master Boot Record and it normally contains two vital components: a tiny OS-specific bootstrapping program at the start of the MBR followed by a partition table for the disk. The BIOS however does not care about any of this: it simply loads the contents of the MBR into memory location 0x7c00 and jumps to that location to start executing whatever code is in the MBR.
Master Boot Record
The specific code in the MBR could be a Windows MBR loader, code from Linux loaders such as LILO or GRUB, or even a virus. In contrast the partition table is standardized: it is a 64-byte area with four 16-byte entries describing how the disk has been divided up (so you can run multiple operating systems or have separate volumes in the same disk). Traditionally Microsoft MBR code takes a look at the partition table, finds the (only) partition marked as active, loads the boot sector for that partition, and runs that code. The boot sector is the first sector of a partition, as opposed to the first sector for the whole disk. If something is wrong with the partition table you would get messages like “Invalid Partition Table” or “Missing Operating System.” This message does not come from the BIOS but rather from the MBR code loaded from disk. Thus the specific message depends on the MBR flavor.
Boot loading has gotten more sophisticated and flexible over time. The Linux boot loaders Lilo and GRUB can handle a wide variety of operating systems, file systems, and boot configurations. Their MBR code does not necessarily follow the “boot the active partition” approach described above. But functionally the process goes like this:
- The MBR itself contains the first stage of the boot loader. GRUB calls this stage 1.
- Due to its tiny size, the code in the MBR does just enough to load another sector from disk that contains additional boostrap code. This sector might be the boot sector for a partition, but could also be a sector that was hard-coded into the MBR code when the MBR was installed.
- The MBR code plus code loaded in step 2 then read a file containing the second stage of the boot loader. In GRUB this is GRUB Stage 2, and in Windows Server this is c:\NTLDR. If step 2 fails in Windows you’d get a message like “NTLDR is missing”. The stage 2 code then reads a boot configuration file (e.g., grub.conf in GRUB, boot.ini in Windows). It then presents boot choices to the user or simply goes ahead in a single-boot system.
- At this point the boot loader code needs to fire up a kernel. It must know enough about file systems to read the kernel from the boot partition. In Linux this means reading a file like “vmlinuz-2.6.22-14-server” containing the kernel, loading the file into memory and jumping to the kernel bootstrap code. In Windows Server 2003 some of the kernel start-up code is separate from the kernel image itself and is actually embedded into NTLDR. After performing several initializations, NTDLR loads the kernel image from file c:\Windows\System32\ntoskrnl.exe and, just as GRUB does, jumps to the kernel entry point.
There’s a complication worth mentioning (aka, I told you this thing is hacky). The image for a current Linux kernel, even compressed, does not fit into the 640K of RAM available in real mode. My vanilla Ubuntu kernel is 1.7 MB compressed. Yet the boot loader must run in real mode in order to call the BIOS routines for reading from the disk, since the kernel is clearly not available at that point. The solution is the venerable unreal mode. This is not a true processor mode (I wish the engineers at Intel were allowed to have fun like that), but rather a technique where a program switches back and forth between real mode and protected mode in order to access memory above 1MB while still using the BIOS. If you read GRUB source code, you’ll see these transitions all over the place (look under stage2/ for calls to real_to_prot and prot_to_real). At the end of this sticky process the loader has stuffed the kernel in memory, by hook or by crook, but it leaves the processor in real mode when it’s done.
We’re now at the jump from “Boot Loader” to “Early Kernel Initialization” as shown in the first diagram. That’s when things heat up as the kernel starts to unfold and set things in motion. The next post will be a guided tour through the Linux Kernel initialization with links to sources at the Linux Cross Reference. I can’t do the same for Windows but I’ll point out the highlights.
[Update: cleared up discussion of NTLDR.]
I’m going to write a few posts about computer internals with the goal of explaining how modern kernels work. I hope to make them useful to enthusiasts and programmers who are interested in this stuff but don’t have experience with it. The focus is on Linux, Windows, and Intel processors. Internals are a hobby for me, I have written a fair bit of kernel-mode code but haven’t done so in a while. This first post describes the layout of modern Intel-based motherboards, how the CPU accesses memory and the system memory map.
To start off let’s take a look at how an Intel computer is wired up nowadays. The diagram below shows the main components in a motherboard and dubious color taste:
Diagram for modern motherboard. The northbridge and southbridge make up the chipset.
As you look at this, the crucial thing to keep in mind is that the CPU doesn’t really know anything about what it’s connected to. It talks to the outside world through its pins but it doesn’t care what that outside world is. It might be a motherboard in a computer but it could be a toaster, network router, brain implant, or CPU test bench. There are three main ways by which the CPU and the outside communicate: memory address space, I/O address space, and interrupts. We only worry about motherboards and memory for now.
In a motherboard the CPU’s gateway to the world is the front-side bus connecting it to the northbridge. Whenever the CPU needs to read or write memory it does so via this bus. It uses some pins to transmit the physical memory address it wants to write or read, while other pins send the value to be written or receive the value being read. An Intel Core 2 QX6600 has 33 pins to transmit the physical memory address (so there are 233 choices of memory locations) and 64 pins to send or receive data (so data is transmitted in a 64-bit data path, or 8-byte chunks). This allows the CPU to physically address 64 gigabytes of memory (233 locations * 8 bytes) although most chipsets only handle up to 8 gigs of RAM.
Now comes the rub. We’re used to thinking of memory only in terms of RAM, the stuff programs read from and write to all the time. And indeed most of the memory requests from the processor are routed to RAM modules by the northbridge. But not all of them. Physical memory addresses are also used for communication with assorted devices on the motherboard (this communication is called memory-mapped I/O). These devices include video cards, most PCI cards (say, a scanner or SCSI card), and also the flash memory that stores the BIOS.
When the northbridge receives a physical memory request it decides where to route it: should it go to RAM? Video card maybe? This routing is decided via the memory address map. For each region of physical memory addresses, the memory map knows the device that owns that region. The bulk of the addresses are mapped to RAM, but when they aren’t the memory map tells the chipset which device should service requests for those addresses. This mapping of memory addresses away from RAM modules causes the classic hole in PC memory between 640KB and 1MB. A bigger hole arises when memory addresses are reserved for video cards and PCI devices. This is why 32-bit OSes have problems using 4 gigs of RAM. In Linux the file /proc/iomem neatly lists these address range mappings. The diagram below shows a typical memory map for the first 4 gigs of physical memory addresses in an Intel PC:
Memory layout for the first 4 gigabytes in an Intel system.
Actual addresses and ranges depend on the specific motherboard and devices present in the computer, but most Core 2 systems are pretty close to the above. All of the brown regions are mapped away from RAM. Remember that these are physical addresses that are used on the motherboard buses. Inside the CPU (for example, in the programs we run and write), the memory addresses are logical and they must be translated by the CPU into a physical address before memory is accessed on the bus.
The rules for translation of logical addresses into physical addresses are complex and they depend on the mode in which the CPU is running (real mode, 32-bit protected mode, and 64-bit protected mode). Regardless of the translation mechanism, the CPU mode determines how much physical memory can be accessed. For example, if the CPU is running in 32-bit mode, then it is only capable of physically addressing 4 GB (well, there is an exception called physical address extension, but ignore it for now). Since the top 1 GB or so of physical addresses are mapped to motherboard devices the CPU can effectively use only ~3 GB of RAM (sometimes less – I have a Vista machine where only 2.4 GB are usable). If the CPU is in real mode, then it can only address 1 megabyte of physical RAM (this is the only mode early Intel processors were capable of). On the other hand, a CPU running in 64-bit mode can physically access 64GB (few chipsets support that much RAM though). In 64-bit mode it is possible to use physical addresses above the total RAM in the system to access the RAM regions that correspond to physical addresses stolen by motherboard devices. This is called reclaiming memory and it’s done with help from the chipset.
That’s all the memory we need for the next post, which describes the boot process from power up until the boot loader is about to jump into the kernel. If you’d like to learn more about this stuff, I highly recommend the Intel manuals. I’m big into primary sources overall, but the Intel manuals in particular are well written and accurate. Here are some:
- Datasheet for Intel G35 Chipset documents a representative chipset for Core 2 processors. This is the main source for this post.
- Datasheet for Intel Core 2 Quad-Core Q6000 Sequence is a processor datasheet. It documents each pin in the processor (there aren’t that many actually, and after you group them there’s really not a lot to it). Fascinating stuff, though some bits are arcane.
- The Intel Software Developer’s Manuals are outstanding. Far from arcane, they explain beautifully all sorts of things about the architecture. Volumes 1 and 3A have the good stuff (don’t be put off by the name, the “volumes” are small and you can read selectively).
- Pádraig Brady suggested that I link to Ulrich Drepper’s excellent paper on memory. It’s great stuff. I was waiting to link to it in a post about memory, but the more the merrier.