How The Kernel Manages Your Memory

After examining the virtual address layout of a process, we turn to the kernel and its mechanisms for managing user memory. Here is gonzo again:

Linux kernel mm_struct

Linux processes are implemented in the kernel as instances of task_struct, the process descriptor. The mm field in task_struct points to the memory descriptor, mm_struct, which is an executive summary of a program’s memory. It stores the start and end of memory segments as shown above, the number of physical memory pages used by the process (rss stands for Resident Set Size), the amount of virtual address space used, and other tidbits. Within the memory descriptor we also find the two work horses for managing program memory: the set of virtual memory areas and the page tables. Gonzo’s memory areas are shown below:

Kernel memory descriptor and memory areas

Each virtual memory area (VMA) is a contiguous range of virtual addresses; these areas never overlap. An instance of vm_area_struct fully describes a memory area, including its start and end addresses, flags to determine access rights and behaviors, and the vm_file field to specify which file is being mapped by the area, if any. A VMA that does not map a file is anonymous. Each memory segment above (e.g., heap, stack) corresponds to a single VMA, with the exception of the memory mapping segment. This is not a requirement, though it is usual in x86 machines. VMAs do not care which segment they are in.

A program’s VMAs are stored in its memory descriptor both as a linked list in the mmap field, ordered by starting virtual address, and as a red-black tree rooted at the mm_rb field. The red-black tree allows the kernel to search quickly for the memory area covering a given virtual address. When you read file /proc/pid_of_process/maps, the kernel is simply going through the linked list of VMAs for the process and printing each one.

In Windows, the EPROCESS block is roughly a mix of task_struct and mm_struct. The Windows analog to a VMA is the Virtual Address Descriptor, or VAD; they are stored in an AVL tree. You know what the funniest thing about Windows and Linux is? It’s the little differences.

The 4GB virtual address space is divided into pages. x86 processors in 32-bit mode support page sizes of 4KB, 2MB, and 4MB. Both Linux and Windows map the user portion of the virtual address space using 4KB pages. Bytes 0-4095 fall in page 0, bytes 4096-8191 fall in page 1, and so on. The size of a VMA must be a multiple of page size. Here’s 3GB of user space in 4KB pages:

4KB Pages Virtual User Space

The processor consults page tables to translate a virtual address into a physical memory address. Each process has its own set of page tables; whenever a process switch occurs, page tables for user space are switched as well. Linux stores a pointer to a process’ page tables in the pgd field of the memory descriptor. To each virtual page there corresponds one page table entry (PTE) in the page tables, which in regular x86 paging is a simple 4-byte record shown below:

x86 Page Table Entry (PTE) for 4KB page

Linux has functions to read and set each flag in a PTE. Bit P tells the processor whether the virtual page is present in physical memory. If clear (equal to 0), accessing the page triggers a page fault. Keep in mind that when this bit is zero, the kernel can do whatever it pleases with the remaining fields. The R/W flag stands for read/write; if clear, the page is read-only. Flag U/S stands for user/supervisor; if clear, then the page can only be accessed by the kernel. These flags are used to implement the read-only memory and protected kernel space we saw before.

Bits D and A are for dirty and accessed. A dirty page has had a write, while an accessed page has had a write or read. Both flags are sticky: the processor only sets them, they must be cleared by the kernel. Finally, the PTE stores the starting physical address that corresponds to this page, aligned to 4KB. This naive-looking field is the source of some pain, for it limits addressable physical memory to 4 GB. The other PTE fields are for another day, as is Physical Address Extension.

A virtual page is the unit of memory protection because all of its bytes share the U/S and R/W flags. However, the same physical memory could be mapped by different pages, possibly with different protection flags. Notice that execute permissions are nowhere to be seen in the PTE. This is why classic x86 paging allows code on the stack to be executed, making it easier to exploit stack buffer overflows (it’s still possible to exploit non-executable stacks using return-to-libc and other techniques). This lack of a PTE no-execute flag illustrates a broader fact: permission flags in a VMA may or may not translate cleanly into hardware protection. The kernel does what it can, but ultimately the architecture limits what is possible.

Virtual memory doesn’t store anything, it simply maps a program’s address space onto the underlying physical memory, which is accessed by the processor as a large block called the physical address space. While memory operations on the bus are somewhat involved, we can ignore that here and assume that physical addresses range from zero to the top of available memory in one-byte increments. This physical address space is broken down by the kernel into page frames. The processor doesn’t know or care about frames, yet they are crucial to the kernel because the page frame is the unit of physical memory management. Both Linux and Windows use 4KB page frames in 32-bit mode; here is an example of a machine with 2GB of RAM:

Physical Address Space

In Linux each page frame is tracked by a descriptor and several flags. Together these descriptors track the entire physical memory in the computer; the precise state of each page frame is always known. Physical memory is managed with the buddy memory allocation technique, hence a page frame is free if it’s available for allocation via the buddy system. An allocated page frame might be anonymous, holding program data, or it might be in the page cache, holding data stored in a file or block device. There are other exotic page frame uses, but leave them alone for now. Windows has an analogous Page Frame Number (PFN) database to track physical memory.

Let’s put together virtual memory areas, page table entries and page frames to understand how this all works. Below is an example of a user heap:

Physical Address Space

Blue rectangles represent pages in the VMA range, while arrows represent page table entries mapping pages onto page frames. Some virtual pages lack arrows; this means their corresponding PTEs have the Present flag clear. This could be because the pages have never been touched or because their contents have been swapped out. In either case access to these pages will lead to page faults, even though they are within the VMA. It may seem strange for the VMA and the page tables to disagree, yet this often happens.

A VMA is like a contract between your program and the kernel. You ask for something to be done (memory allocated, a file mapped, etc.), the kernel says “sure”, and it creates or updates the appropriate VMA. But it does not actually honor the request right away, it waits until a page fault happens to do real work. The kernel is a lazy, deceitful sack of scum; this is the fundamental principle of virtual memory. It applies in most situations, some familiar and some surprising, but the rule is that VMAs record what has been agreed upon, while PTEs reflect what has actually been done by the lazy kernel. These two data structures together manage a program’s memory; both play a role in resolving page faults, freeing memory, swapping memory out, and so on. Let’s take the simple case of memory allocation:

Example of demand paging and memory allocation

When the program asks for more memory via the brk() system call, the kernel simply updates the heap VMA and calls it good. No page frames are actually allocated at this point and the new pages are not present in physical memory. Once the program tries to access the pages, the processor page faults and do_page_fault() is called. It searches for the VMA covering the faulted virtual address using find_vma(). If found, the permissions on the VMA are also checked against the attempted access (read or write). If there’s no suitable VMA, no contract covers the attempted memory access and the process is punished by Segmentation Fault.

When a VMA is found the kernel must handle the fault by looking at the PTE contents and the type of VMA. In our case, the PTE shows the page is not present. In fact, our PTE is completely blank (all zeros), which in Linux means the virtual page has never been mapped. Since this is an anonymous VMA, we have a purely RAM affair that must be handled by do_anonymous_page(), which allocates a page frame and makes a PTE to map the faulted virtual page onto the freshly allocated frame.

Things could have been different. The PTE for a swapped out page, for example, has 0 in the Present flag but is not blank. Instead, it stores the swap location holding the page contents, which must be read from disk and loaded into a page frame by do_swap_page() in what is called a major fault.

This concludes the first half of our tour through the kernel’s user memory management. In the next post, we’ll throw files into the mix to build a complete picture of memory fundamentals, including consequences for performance.

Comments

113 Responses to “How The Kernel Manages Your Memory”

  1. jpm on February 3rd, 2009 11:56 pm

    Another awesome post!

  2. Robert on February 4th, 2009 12:47 am

    Hi

    Wonderful post!

    Got a question regarding memory in Windows though. Though not stated in your post, I was wondering if you could help.

    In Windows, every process has its page directory located at virtual addr 0xC0300000 (or page table at 0xC0000000). However, the top 2GB (or 1GB) of virtual address space (0×80000000 or 0xC0000000 and above onwards) for each process actually maps to the same physical space because that is the kernel mem space (which is shared and identical amongst all process). So how can each individual process actually have different page directory/tables?

    I’m sure i have an invalid logic somewhere, so kindly explain to me in details, if possible. Thanks

  3. Robert on February 4th, 2009 12:49 am

    Sorry if that question was out of topic….
    :)

  4. Gustavo Duarte on February 4th, 2009 1:07 am

    @jpm: thanks!

    @Robert: definitely on topic. :)

    The trick here is that the CPU actually wants a _physical_ address for the page directory, and _that_ address does change for each process.

    Specifically, the _physical_ address of the page directory must be loaded into a register called CR3. When the kernel is about to switch to a process, say notepad.exe, it will load the _physical_ address of notepad’s page directory into CR3. When it switches to FireFox, a different physical address will go into CR3.

    If you run the Windows kernel debugger (kd.exe) and type !process, it’ll print the ‘DirBase’ for the currently running process, which is the physical address of the page directory. It’ll be different for every process.

    The Windows kernel must then update the system page tables to make 0xC0300000 point to the physical address of the new page directory as well. The PTE for that address (0xC0300000) is not marked global, so that when CR3 is refreshed, its contents are purged from the TLB in case the processor had cached the old physical address for it.

    Hope this helps. Feel free to ask follow up questions, though the next answer might take a while because I’m UTC-7 and it’s 1am ;)

  5. web development on February 4th, 2009 4:50 am

    Great thanks for yet another good article. Nice work.

  6. Ryan on February 4th, 2009 5:43 am

    Great post. Thank you for the detail and the time you spent on the diagrams.

  7. Claudio Cardozo on February 4th, 2009 8:05 am

    Nice article, nice diagrams! it’s so clear! tks!

  8. Bruce Markham on February 4th, 2009 8:44 am

    Gustavo, I’ve been reading your posts for several weeks now. As a rising open-source operating system developer, I’ve found you to be very enlightening on topics that others are cryptic on.

    Keep up the great work!

  9. The Thing King : Gustavo Duarte on February 4th, 2009 10:13 am

    [...] hope the previous post explained virtual memory adequately, but I must admit I held back a much better explanation, which [...]

  10. Programming Examples: Linux Theory on February 4th, 2009 10:53 am

    [...] https://secure.delicious.com/login?noui=yes&v=5&src=ffbmext2.1.018&partner=ffbmext http://duartes.org/gustavo/blog/post/how-the-kernel-manages-your-memory page tags: todo-linux page_revision: 5, last_edited: 1233762364|%e %b %Y, %H:%M %Z (%O ago) [...]

  11. Software Quality Digest – 2009-02-04 | No bug left behind on February 4th, 2009 12:51 pm

    [...] How The Kernel Manages Your Memory – Also by Gustavo with a follow-up about the related memory management by the kernel [...]

  12. john on February 4th, 2009 5:30 pm

    Hi Gustavo, I have been reading your internals blogs. The diagrams included in these blogs look really nice and clear. What tool did you use to generate these diagrams (.png image files)?

  13. Robert on February 4th, 2009 7:47 pm

    Hi Gustavo,

    Thanks for the prompt reply (for post no.2). That is indeed clear and concise. I’ve got afew follow up questions, hope you dont mind! ;)

    For Windows, since 0xC0300000 / 0xC0000000 differs for each process, so am i right to say that the kernel mem space (upper 2GB) for all processes are actually not identical? My impression was that the kernel mem space for all processes are mapped to the same physical addr… perhaps not for Windows but for Linux only (as quoted from “Anatomy of a Program in Memory” -> In Linux, kernel space is constantly present and mapped to the same physical memory in all processes.)

    Another question for Windows: should say, for example, during runtime process A loads a kernel driver (eg SysInternals tools), will this driver code actually be in the kernel address space of other processes as well? Will the stack data of the kernel also be the same across all processes?

    Thanks for the help~!

  14. wartalker on February 4th, 2009 8:07 pm

    wonderful!!

  15. links for 2009-02-04 « My Weblog on February 4th, 2009 9:01 pm

    [...] How The Kernel Manages Your Memory : Gustavo Duarte (tags: linux architecture cs) [...]

  16. Arthur on February 5th, 2009 1:12 am

    Great article. I’m a little bit curious about the page size. You mention that “Both Linux and Windows map the user portion of the virtual address space using 4KB pages.” — what are some situations where bigger page sizes (2MB, 4MB) are used (if any)?

  17. Gustavo Duarte on February 5th, 2009 2:18 am

    Thank you all for the feedback.

    @John: It’s MS Visio 2007, here’s more info:

    http://duartes.org/gustavo/blog/post/quick-note-on-diagrams-and-the-blog

    @Robert: you’re welcome. Here are the answers on the other questions. Windows internals is a bit tougher due to the lack of source code, but I think this is pretty accurate.

    1. You’re 100% right – kernel space in Windows is not identical for every process. Not only there’s this page directory issue, but also in Windows ‘session space’ is mapped in kernel space. So processes in different sessions (ie, processes on different Remote Desktop connections) will have difference session space, hence differences in their kernel space mappings. Two processes in the same session though would have the same session space.

    In x86 Linux the mappings for kernel space _are_ identical in all the processes. They are built during boot (see here) and stored in swapper_pg_dir (that’s an extern, not the actual declaration). Since it’s Linux, processes are forked (copied) when they’re being started, so the copy_process() function is called, which copies the memory descriptor, which eventually clones physical addresses of page tables in swapper_pg_dir into the new process’ page tables. Wheew! The result is simple though: shared kernel space for everyone.

    Regarding the driver, yes, visible to everybody.

    Regarding stack, no. Each thread has its own kernel-mode stack. But this can be arranged by manipulating the stack pointer to different virtual addresses, so no page table hacking needed. Having per-thread kernel stacks is important, because when a thread goes to sleep in the middle of a system call (say, waiting on disk) its state must be preserved, and that’s basically the stack (and registers, which get pushed onto the stack).

    @Arthur: good catch :)

    It turns out that Linux maps kernel space using 4MB pages. Windows also does it, depending on the amount of memory in the box. This is mainly for performance, to save entries in the TLB cache, which caches the physical address associated with a page. By using 4MB pages, the kernel uses up less TLB entries, and dirties less of them when handling syscalls and interrupts.

  18. sesedi on February 5th, 2009 5:04 am

    Nice article, and I really want to know what program do you use to draw these amazing diagrams?

  19. Nix on February 5th, 2009 7:07 am

    sesedi: this comes up in almost every post, I guess people like the diagrams ;) it’s Visio.

    (You might want to try Inkscape at some point. It’s far nicer to use than the dire dia.)

  20. Gustavo Duarte on February 5th, 2009 9:32 am

    @Nix: thanks for the suggestion. I’ll give it a shot.

  21. dev-interview on February 5th, 2009 11:41 am

    Nice article and excellent writing style.

  22. How The Kernel Manages Your Memory | www.pwnage.ro on February 5th, 2009 12:35 pm

    [...] How The Kernel Manages Your Memory VN:F [1.0.9_379]please wait…Rating: 0.0/10 (0 votes cast) [...]

  23. triton on February 5th, 2009 10:02 pm

    Excellent article. I really enjoy reading them. Thanks!

    Can’t wait for Part 2. :)

  24. Gustavo Duarte on February 5th, 2009 11:29 pm

    thank you guys for the feedback.

    @triton: I’m getting pretty fast at cranking these out, I should have some time over the weekend ;)

  25. Weekly linkdump #162 - max - блог разработчиков on February 6th, 2009 12:28 am

    [...] Много букв, цифр и красивых картинок o Linux internals, How The Kernel Manages Your Memory : Gustavo Duarte [...]

  26. macosx on February 9th, 2009 2:09 am

    做的不错。我在这里学到了很多知识。加油!
    Great jobs. I have learn more knowledge from here.Come on!

  27. siddharth on February 10th, 2009 12:30 am

    Hi Gustavo,
    Man ,you are just too good.Your diagrams and explanation have cleared many of my doubts regarding virtual memory,memory mapping.
    Looking forward to reading many more articles in the future.

    Thanks
    Siddharth

  28. Page Cache, the Affair Between Memory and Files : Gustavo Duarte on February 10th, 2009 11:53 pm

    [...] we looked at how the kernel manages virtual memory for a user process, but files and I/O were left out. This post covers the important and often [...]

  29. Gustavo Duarte on February 11th, 2009 8:52 am

    @siddharth: cool, I’m glad it helped!

  30. Majid on February 12th, 2009 8:45 pm

    Hi Gustavo,

    your articles are interesting indeed.
    I have a problem, when a kernel is loaded and enabled paging, all memory addressing is per virtual memory addressing among kernel itself. So when the kernel calls it’s functions or uses it’s memory area (variables and etc) it must follow new addressing model (virtual), so some things must do in boot loader like load kernel on, say, 0×100000 physical memory address but, say, at 0xC0000000 as virtual address, now if kernel switch to paging, it must map 0×100000 to 0xC0000000 to keep on it’s running. How can I doing it? and how manage kernel’s memory? and when paging is enabled we must yet use CS, DS and SS for code, data and stack positions respectively?
    Can you explain it in detail, please?

    Thanks a lot for your attention.

  31. Gustavo Duarte on February 12th, 2009 11:17 pm

    Majid,

    That’s a great question. I’ll give you some pointers, but I’m short on time.

    During the boot process, the kernel builds page tables before enabling virtual memory in the CPU. The initial paging tables map two virtual ranges onto the first 8 megabytes of physical memory. The code is here. The first range is called the identity mapping and it maps virtual addresses 0-8MB (not sure on the 8MB, but something like that) onto physical addresses 0-8MB. The second virtual range maps another ~8MB starting at virtual address 0xc0000000 onto the 0-8MB of physical memory.

    That way, immediately after the paging is enabled, things are still working OK thanks to the identity mapping. The kernel can then jump to the 0xc0000000 addresses and it will also work thanks to the second range. It then calls a function called ‘zap_low_mappings’ or something like that to blow away the identity mapping. Everything is up and running as far as paging goes by then.

    Regarding cs, ds, and ss, they are already active even before paging is turned on. But they use ‘flat model’ so they are basically transparent. They all start at 0 and span 4GB, so all logical addresses are the same as linear addresses, it’s as if segmentation does not exist. These are set up BEFORE paging is enabled though, they are independent of paging.

    Hope this helps. I’m going to be away for a few days, but feel free to post more questions and I’ll eventually get to them.

  32. Majid on February 13th, 2009 3:42 am

    Thanks a lot Gustavo, your explain is really clear and fluent whereas it is concise.
    That is very useful for me.
    Thanks again for your attention and helps.

  33. Majid on February 17th, 2009 8:50 pm

    Hi Gustavo,

    Excuse me for my (probably) bad english.
    When we want to allocate new page frame, we must fill related page table’s entry, so we must access to page table memory address with it’s virtual address. What is your idea to do that? create an entry to link itself, so we can access it as if normal page, or another idea?

    Thanks

  34. Gustavo Duarte on February 18th, 2009 8:54 am

    Hi Majid,

    When the kernel allocates memory to store a page table, it asks the allocator for memory that is always mapped by the kernel space (in x86 Linux’s case, this means physical memory between 0 and 896MB, which is mapped starting at 0xc0000000). Hence page tables for processes can always be reached with kernel-space virtual addresses.

    The pgd pointer in the mm_struct points to the first level of page tables (the page directory), so the kernel can just use the pointer to access it. To reach other levels of the page table, the kernel may have to calculate which linear address corresponds to the physical address of the page table, but this is easy to do because the first 896MB are mapped continuously in kernel space, so it’s a simple operation to go from physical address to virtual address.

  35. Ya-tou & me » Blog Archive » Page Cache, the Affair Between Memory and Files on February 19th, 2009 1:38 am

    [...] we looked at how the kernel manages virtual memory for a user process, but files and I/O were left out. This post covers the important and often [...]

  36. littlekernel on March 3rd, 2009 7:32 am

    Hi, I like your blog! you reply to Majid in #34:
    “(in x86 Linux’s case, this means physical memory between 0 and 896MB,
    which is mapped starting at 0xc0000000).”

    I have a question: In order to save physical memory, kernel and process
    may map same virtual address to one physical address. For example:

    0xc500 0000 -> 0×0500 0000
    0×4567 0000 -> 0×0500 0000

    Assume that kernel modifies the contents at virtual address 0×4567 0000,
    that is, the position at physical address 0×0500 0000, then the user
    process will see this change!

    Now that kernel maps the physical memory between 0 and 896MB to 3G+
    virtual address space , where the user process will map its virtual
    address? I mean, the position of its corresponding physical address.

    the first 896MB physical memory belongs to kernel, right?
    then where a user process map its virtual address?
    If the process and the kernel share the first 896MB physical memory,
    How they avoid colliding?

    this question haunting me for a long time, thank you for your help!

  37. littlekernel on March 3rd, 2009 7:34 am

    sorry ,
    Assume that kernel modifies the contents at virtual address 0×4567 0000,
    should be
    Assume that kernel modifies the contents at virtual address 0×c500 0000,

  38. sxg on March 11th, 2009 1:50 am

    Hi Gustavo Duarte,
    Nice post, very much informative! For the first time in my life, I understand the internals of the memory management, thank you! And I got a question here: where and how does the linux kernel maintains the physical page frames? Can you give some information about physical page frames’ initialization and maintenance in the kernel?

  39. Gustavo Duarte on March 23rd, 2009 6:39 pm

    @littlekernel,

    Sorry about the delay.

    So, the kernel does _not_ take the first 896MB of physical memory. It only takes the last 1GB of _virtual space_ to use it to map whatever physical memory it wants. In Linux’s case, the kernel uses this last gigabyte to map the first 896MB of _physical_ memory and then uses the remaining 128MB of virtual address space to map things dynamically and for other special-purpose mappings.

  40. Gustavo Duarte on March 23rd, 2009 6:45 pm

    @sxg: you’re welcome.

    Regarding the page frame stuff, they’re kept in an array of structs of type page_structs. The array itself is called mem_map. In computers where RAM access is uniform to all processors (uniform memory access) we usually have a single huge mem_map array with all the page structs in it. This is the case for most Intel processors prior to the Core 2 Duo. AMD processors, and Intel now with the Core 2 duo, are NUMA – non uniform memory access, where processors are attached directly to memory. In this situation you have multiple mem_map arrays to describe the physical memory attached to each processor.

    I’ll write more about this topic, but I’m in a bit of a hiatus right now, I’ll post shortly on what’s going on.

  41. Rohit Banga on March 25th, 2009 2:16 am

    great set of posts!
    i want to write a kernel module to print the page table of a particular process and access physical memory address to read the value of a variable of another process. i have played a little with kernel modules (like changing value of runnable to 0 for init_task, changing pid of all processes…resulting in flooding of kernel error messages ), but i am unable to proceed with this program. could you help me?

  42. Gustavo Duarte on March 26th, 2009 5:34 pm

    @Rohit: I can help some. However, it’s a bad time for me right now, because I started writing a blog / publishing engine to run the blog on, so that’s taking all my free time. Feel free to contact me directly via email or here via the comments.

  43. iceui on April 12th, 2009 4:12 pm

    Great post!
    I’m working on something protection-related and have a question: how to set one area (say, 4kB in size, or 4MB, or 8MB, etc. in kernel space) to be read-only or non-executable or non-writable for kernel itself? Thanks.

  44. Gustavo Duarte on April 14th, 2009 1:46 pm

    iceui: Are you thinking of it in terms of the Linux kernel API for these things, or in terms of what such a page table entry would look like in x86 processors?

  45. iceui on April 14th, 2009 11:55 pm

    I’m thinking of using my driver in kernel space to do this…similar to kernel API…to protect other driver/API to read/write/exec some kernel area.

    As you point out, Linux maps kernel space using 4MB pages. Does it mean Kernel virtual space has, say, 1GB/4MB==250 pages? Where are they? Can we manipulate those kernel page table entries in terms of R/W, D, A, P, etc? http://static.duartes.org/img/blogPosts/x86PageTableEntry4KB.png

    We also know Linux maps kernel virtual address 0xC0000000 to physical address 0×00000000…seems only one kernel page mapping available and we cannot modify any R/W, D, A, etc.?

    I do appreciate if you have time to clarify my doubts. Thanks!

  46. CNN on April 15th, 2009 12:11 am

    Gustavo Duarte, we are eager to read your posts about the dll.

  47. iceui on April 16th, 2009 3:58 am

    I’m thinking of using my driver in kernel space to do this…similar to kernel API…to protect other driver/API to read/write/exec some kernel area.

    As you point out, Linux maps kernel space using 4MB pages. Does it mean Kernel virtual space has, say, 1GB/4MB==250 pages? Where are they? Can we manipulate those kernel page table entries in terms of R/W, D, A, P, etc?

    We also know Linux maps kernel virtual address 0xC0000000 to physical address 0×00000000…seems only one kernel page mapping available and we cannot modify any R/W, D, A, etc.?

    I do appreciate if you have time to clarify my doubts. Thanks!

  48. deepak v m on April 16th, 2009 5:46 am

    sir i want source code for static,stack and heap memory allocation done in graphics (whole source code) please send it to my mail .
    i requeat you please to send me the same. ok thanking you

  49. Johnny on April 23rd, 2009 12:54 am

    I really like your graphical representations, your pictures are even better than some books, Great!

  50. Narayanan on April 26th, 2009 12:05 am

    hi Gustavo,

    Nice work. please continue your posting.

    I need clarification on swappable and now swappable pages in the memory, I ve heard that certain pages are not swappable. can you please explain on that..? .

    Are you going to post on Linux Block I/O .? I m very interested on it.

  51. Gustavo Duarte on April 26th, 2009 1:18 am

    @Narayanan: thanks for the feeback. I’ll definitely continue posting. To be honest, I took a little detour writing software to post _with_. That’s why the blog has been quiet. I’ll give an update on this stuff soon, but in short, there will definitely be more content.

  52. Joel on June 18th, 2009 6:37 am

    Hi Duarte,
    I have read and re-read your articles several times, they’re just like story-telling, Thank you.

    I have a burning question: The kernel has a page structure that describes a physical page in memory. But this description does not including the physical address of the page. How is this retrieved? (I understand pte structures have the physical address data but how we reference a pte structure from a page structure?)

    Also I understand that mem_map is a global variable that is a list of all pages in physical memory (correct me if I’m wrong). But, what use this array if we can’t figure out the physical address of a page?

    Thanks!
    Joel

  53. Ronaldo on July 14th, 2009 2:25 am

    I have read your post. Its excellent.

    I am trying to understand the concept of virtual memory. But however hard I try.. I still dont get it. My bsaic question is as follows:

    Virtual address from 0×00000000 to 0×00300000 are mapped to physical addresses 0×00000000 to 0×00300000. ie we have an identity mapping, by means of page tables some where in physical memory whose virtual addresses are not known.

    Now my physical frame allocator says that a page in physical memory starting at addresses 0×00400000 is free ie page frame 1024 is free.

    How can I now access this page

  54. naveen on July 24th, 2009 3:24 am

    Great thanks Gustavo Duarte…this article is excellent..

    Can you please give us some reference to good websites or blogs that deals with in depth about Kernel core.

  55. srikanth on August 2nd, 2009 8:23 pm

    HI Gustavo Duarte,

    I do not why you are taking so much effort to explain such excellent articles. I really appreciate your efforts.

    Thank you very much …….

  56. Rohit Banga on August 27th, 2009 3:04 am

    Hello
    Gustavo, you done with with your blog engine. doing it alone!
    it seems you are writing well engineered code. that is why it is taking so much of time.

  57. Atul Deshmukh on September 9th, 2009 7:05 am

    Hi Gustavo,
    I must appreciate ur great work..Actually,putting the things with examples and with source code is really expected while studing linux kernel internals.
    Being a linux freak,I started to read ur article on memory mgmt amd couldnt stop myself from further reading..Great work indeed.

  58. Atul Deshmukh on September 9th, 2009 7:13 am

    Hi Gustavo,
    I agree with Joel.
    The high memory stuff in which page discrptrs are allocated instead of pages since those pages dont have corresponding logical kernel address,is somewhere missing in your article.
    I hope u will come up with it..
    Anyways Kudos for your great work.

  59. bnousilal on October 10th, 2009 8:37 pm

    hi sir………..
    nice work,really fantastic and awesome work, please continue your posting…..

  60. Shammi on November 1st, 2009 11:13 pm

    Hello Gustavo,

    I recently found these blog. It is very helpful for me to understand some basic things. I really appreciate your time for this.

    I have some basic doubts, which i think you can easily answer.

    1) If we have same shared libraries(eg: libc.so) loaded for more than 2 process. The shared library VM address in the two process will be identical or different ?, as far as i know, in physical memory they will be loaded only once.

    2) The entire VM address of a process is created during process creation, or the VM address is also created dynamically ?. eg: if i am creating a process p1(which uses lots of shared libraries), the whole VM address (if it uses a VM address space of 2GB) is allocated dynamically or statically during process creation.

    Can you please address my queries ?.

    Regards,
    Shammi

  61. Joel Fernandes on November 23rd, 2009 12:21 am

    Hi Shammi,

    There are different sections that are required for every section, different VM area structs are created for these. VM addresses are allocated for these sections by default. Access to any other virtual address results in a fault. The process ofcourse has all the memory for itself (no one else can take it) but to make this memory accessible dynamically, you would do it in many ways: malloc to allocate memory on the heap, grow the heap with brk system call or memory map a file.

    Hope this answers your question.

  62. Joel Fernandes on November 23rd, 2009 12:27 am

    Hi Shammi,

    Answering your first question, the shared library virtual addresses could be different (shared libraries are compiled with position independent code so they can be relocated any where in virtual memory) – the dynamic loader takes care of resolving symbols at runtime. The shared library’s physical memory is allocated only once and shared by all other processes (think about how many different copies of the c standard library would be there otherwise :) ).

  63. Lin yujing on December 16th, 2009 2:02 am

    Wonderful!
    I have a question, there is only 20 bit for physical address. How could it represents the memory physical address?

    Thank you!

  64. leal on December 30th, 2009 10:53 pm

    Hi Lin,
    kernel manages physical memory with unit of 4KB, which is 2^12, so 20bit is left enough to address these 4KB unit on 32-bit platform.

  65. Interesting Reading… – The Blogs at HowStuffWorks on January 11th, 2010 11:09 am

    [...] How The Kernel Manages Your Memory – “After examining the virtual address layout of a process, we turn to the kernel and its mechanisms for managing user memory…” [...]

  66. sagar bansal on March 4th, 2010 2:18 pm

    i need to know the exact procedure what happens wen page is selected to swap out..where in pte entry is made?? low level details wat struct is used??
    plz help me out

  67. Aman on April 4th, 2010 1:42 pm

    Hi Gustavo!

    I have few confusions really hurting me

    - files get mapped and unmapped in “memory map segment” of a process. Even worse that different portions of one file can get mapped at different location in mmap_segment. Now consider that our memory map segment has consumed 1MB virtual address space. And a file segment from 1k – 20k segment gets unmapped. Now we have fragmented mmap_region. How the fragmentation in memory map segment is handled???
    - One base pointer of mmap_segment is present in picture but no pointer pointing to top of mmap_segment. How we will know that how much mmap_segment’s virtual memory has been consumed???

    Regards,
    Aman

  68. Mandar on April 14th, 2010 3:37 am

    Have a question very similar to that by “littlekernel”. When Linux maps its kernel virtual address space (starting from PAGE_OFFSET) to 896 MB of physical memory, it essentially creates page tables that map all the page frames in this 896 MB area, correct? Of course, the kernel itself consumes only few megabytes of physical memory, but because all the addressable RAM is mapped into the 1 GB kernel virtual address space, how do we map 3 GB of user virtual address space? Every page frame referenced by a process page table (via one of the first 768 entries in the page global directory) will also have a corresponding entry in the kernel page table (the remaining 256 or so entries) because that page frame has already been mapped into the kernel address space during boot time. Am I missing something?

  69. eko didik widianto on June 6th, 2010 4:22 am

    Thank you very much.

    Keep on your good work.

  70. Sakthikumar Rajaram on June 10th, 2010 5:18 am

    Hi Gustavo,

    Iam Sakthikumar. Iam working in Motorola.

    I have a doubt.

    You said driver written in windows will be visible to all the user processes

    Is that possible for that driver to access all the user processes details (data,heap,stack,BSS,memory mapping) details and encrypt it before sending it to RAM Page Frames.

    Waiting for your reply

    Regards
    Sakthikumar

  71. Hugho on July 2nd, 2010 9:57 am

    Hello,

    Thanks for explaining PDEs. But I have a problem which I still can’t solve. I wrote a kernel driver for accessing the PDE and PTEs at 0xC0600000 (Win 7 PAE) and dumped the mem with a small kernel routine – with the result, that at the above mentioned address there is nothing than zeros – even at 0xC0000000. With LiveKd & WinDbg (Local Kernel debug) I get the correct entries at this memory loc. Would you please tell me what I’m doin wrong? thanks for any answers!

    Here’s the src of my kernel driver:

    include
    #include “memory.h”

    #define _PDTP_MASK 0xC0000000
    #define _PDE_MASK 0x3FE00000
    #define _PTE_MASK 0x001FF000
    #define _POFF_MASK 0x00000FFF

    #define _PDTP_SHIFT 30
    #define _PDE_SHIFT 21
    #define _PTE_SHIFT 12

    #define _SYSTEM_PDP_BASE 0xC0300000
    #define _SYSTEM_PDP_BASE_PAE 0xC0600000

    #define _PTE_SIZE 8

    #pragma pack(1)
    typedef struct
    {
    unsigned int Valid : 1;
    unsigned int Dirty1 : 1;
    unsigned int Owner : 1;
    unsigned int WriteThrough : 1;
    unsigned int CacheDisable : 1;
    unsigned int Accessed : 1;
    unsigned int Dirty : 1;
    unsigned int LargePage : 1;
    unsigned int Global : 1;
    unsigned int CopyOnWrite : 1;
    unsigned int Unused : 1;
    unsigned int Write : 1;
    ULONG PageFrameNumber : 26;
    } __MMPTE, *__PMMPTE;

    #pragma pack()

    static ULONG64 PagePTE;
    static PHYSICAL_ADDRESS PhysAddr;
    static ULONG64 *pllu, llu;
    static PMDL pMyMDL= NULL;

    void DumpMem( UCHAR* pos, ULONG Size)
    {
    ULONG i, a;

    for( i= 0; i <= Size; i+=16)
    {
    DbgPrint(“0x%08lX:\t”, (ULONG) pos );
    for(a=0; a > (ULONG) _PDTP_SHIFT) * (ULONG)_PTE_SIZE;
    PDTP= ((ULONG) temp + (ULONG) + 0xC0600000 );
    DbgPrint( “Temp:\t0x%08lx\n”, temp );
    DbgPrint( “PageDirTablePointerEntry:\t0x%08lx\n”, PDTP );
    _asm
    {
    cli
    };
    DumpMem( 0xC0600000, 32 );
    _asm
    {
    sti
    };
    PDTPE= (__PMMPTE) PDTP;
    pllu= (PULONG64) PDTP;
    pMyMDL= MmCreateMdl( NULL,
    0xC0000000,
    4096 );
    MmBuildMdlForNonPagedPool( pMyMDL );
    pMyMDL->MdlFlags = pMyMDL->MdlFlags | MDL_MAPPED_TO_SYSTEM_VA;
    pllu= (PULONG64) MmMapLockedPages( pMyMDL, KernelMode );
    DbgPrint(“Content:\t0x%llx\n”, *pllu);
    DumpMem( (UCHAR *) 0xC0000000, 32 );
    MmUnmapLockedPages( pllu, pMyMDL );
    IoFreeMdl( pMyMDL );

    // if( PDTPE->Valid != 1 )
    // {
    // DbgPrint( “is not in memory\n” );
    // return (status= STATUS_UNSUCCESSFUL );
    // }
    DbgPrint(“is in memory\n”);

    return status;
    }

    void OnUnload( IN PDRIVER_OBJECT DriverObject )
    {
    }

    NTSTATUS DriverEntry( PDRIVER_OBJECT DriverObject, PUNICODE_STRING RegistryPath )
    {
    int i;
    NTSTATUS status;
    DriverObject->DriverUnload= OnUnload;
    GetPagePTEAddress( 0x8e221000 );
    for( i= 0; i < 100000; i++ );
    status= STATUS_SUCCESS;
    return status;
    }

    As dump I get only zeros at 0xC0600000 but by LiveKd I get
    the correct page table entries!

    ||0:lkd> dc 0xC0600000
    c0600000 a8fed867 00000000 94235867 00000000 g…….gX#…..
    c0600010 0783c867 00000000 0d1f4867 00000000 g…….gH……
    c0600020 2c89a867 00000000 306a4867 00000000 g..,….gHj0….
    c0600030 00000000 00000000 00000000 00000000 …………….
    c0600040 00000000 00000000 00000000 00000000 …………….
    c0600050 00000000 00000000 373d6867 00000000 ……..gh=7….
    c0600060 42b93867 00000000 1aca3867 00000000 g8.B….g8……
    c0600070 00000000 00000000 00000000 00000000 …………….

  72. Hugho Deferell on July 2nd, 2010 10:43 am

    Hello,

    why did you delete my post?

  73. Hugho Deferell on July 2nd, 2010 10:50 am

    Hmmm, Ok, I post it again :-|

    I wrote a kernel driver for accessing the page directory / page tables on Windows 7 (PAE 32 Bit). In this driver I put a small memdump routine for dumping mem eg. @ 0xC0600000. But as a result I got ONLY zeros as dump # 0xC0600000 & 0xC0000000. LiveKd & WinDbg (local kernel debug) shows me the correct result. What I’m doin wrong?

  74. duncan on December 7th, 2010 7:15 am

    how to know a process’s all of memory spaces?
    I execute a Firefox… and then I execute a system call , to get its v-addre,

    for_each_process() -> mm -> mmap -> vm_start & vm_end -> a loop to get every entry start and end…
    but how to get this process’s address in kernel space ?

  75. Volatilitux : Physical memory analysis of Linux systems | Segmentation fault on December 7th, 2010 5:09 pm

    [...] extracting the virtual memory map of each one. I learned a lot reading the kernel source code, some blog posts, and finally the challenge solutions when they got released. And I understood the biggest problem [...]

  76. Volatilitux : Physical memory analysis of Linux systems | Linux-backtrack.com on January 31st, 2011 1:29 pm

    [...] extracting the virtual memory map of each one. I learned a lot reading the kernel source code, some blog posts, and finally the challenge solutions when they got released. And I understood the biggest problem [...]

  77. Linux x86 ShellCodes – 101 « "xcdx80" on March 3rd, 2011 7:04 am

    [...] está no endereço 0 de memória. A forma como resolvemos isso é linkando nossas instruções com “a realidade do sistema”, para tal usamos o comando [...]

  78. Cómo el kernel maneja tu memoria | unblogged on March 16th, 2011 7:34 am

    [...] post que a los nardos tech (y no tanto) les va a parecer genial: cómo administra el kernel la memoria en tu máquina. Explicación simple y [...]

  79. Bagaimanakah Kernel Mengatur Memory | Log Belajar on March 16th, 2011 8:55 am
  80. Matt Joiner on March 16th, 2011 8:26 pm

    Your articles kick ass!! Keep it up.

  81. Linux Kernel 2.6.38 Released with Best Improvement : Ranjith Siji – Programming the Web on March 17th, 2011 3:15 am

    [...] the Iomega iConnect running Linux Debian (diogogomes.com)Linux 2.6.38 Released (linux.slashdot.org)How The Kernel Manages Your Memory (duartes.org) var linkwithin_site_id=197701;var linkwithin_div_class="linkwithin_hook"; [...]

  82. Nitin on March 17th, 2011 11:56 pm

    Great article !

  83. Sysadmin Sunday #23 « Boxed Ice Blog on March 20th, 2011 8:59 am

    [...] How The Kernel Manages Your Memory [...]

  84. A collection of articles on virtual memory « kwoz online on March 22nd, 2011 8:53 am

    [...] Here is a rather interesting article on virtual memory from operating system’s point of view. It appears to link to other posts on similar topics, including processor privileges. Low-level but could be handy. GA_googleAddAttr("AdOpt", "1"); GA_googleAddAttr("Origin", "other"); GA_googleAddAttr("theme_bg", "ffffff"); GA_googleAddAttr("theme_border", "000000"); GA_googleAddAttr("theme_text", "000000"); GA_googleAddAttr("theme_link", "2970A6"); GA_googleAddAttr("theme_url", "2970A6"); GA_googleAddAttr("LangId", "1"); GA_googleAddAttr("Autotag", "technology"); GA_googleFillSlot("wpcom_below_post"); Categories: Uncategorized LikeBe the first to like this post. Comments (0) Trackbacks (0) Leave a comment Trackback [...]

  85. Linux x86 Shellcoding – 101 « "xcdx80" on April 16th, 2011 2:39 pm

    [...] address assigned to our Entry Point "00000000". To solve this we must "link" our instructions with "Linux reality", we can do this using the "ld" [...]

  86. مجله هکر ماهیانه – شماره ۱۲ – می ۲۰۱۱ on May 15th, 2011 8:46 pm

    [...] اگر لازمه توضیح بدم، یعنی به این مقاله نیازی ندارید (: – کرنل چگونه حافظه را مدیریت می‌کند. – چرا هیچگاه نباید بپرسید که آیا اجازه دارید کد را [...]

  87. Prabagaran on May 29th, 2011 11:05 am

    Awesome Awesome post

  88. Rajat on July 21st, 2011 11:05 pm

    Hi Gustavo,

    In Linux, I’m curious about the case when there is only 512MB of actual RAM. As I understood it, Kernel space’s 896MB is always statically physically mapped and is unswappable. How does kernel maps 896 MB -> 512 MB, if it is supposed to be non swappable ?

    Thanks!

  89. Mars on September 27th, 2011 4:47 am

    Really awesome!! I like this article and the way you described.
    Thanks a lot. :)

  90. How The Kernel Manages Your Memory : Gustavo Duarte | FRANCESCO DI FUSCO on October 31st, 2011 1:46 am

    [...] How The Kernel Manages Your Memory : Gustavo Duarte. Share this:TwitterFacebookStumbleUponPrintEmailLinkedInLike this:LikeBe the first to like this post. This entry was posted in Linux by francescodifusco. Bookmark the permalink. [...]

  91. ajeet on November 5th, 2011 7:46 am

    what the code of creating a project the lazy buddy algorithm in unix

  92. Setzer on November 30th, 2011 9:51 pm

    Great article Gustavo!!! Just curious, what software did you use to draw the diagrams?

  93. Ganesh on December 4th, 2011 11:31 am

    Very nice. I loved the explanation.

    You are my Guru

  94. Raghu on January 13th, 2012 5:47 pm

    I have a basic question about virtual to physical address translation in kernel. The virtual addresses are splittled 3:1 between userspace and kernel. so, when the virtual address is in the first 3GB, then page tables of that userspace process is used to find the corresponding physical address. but, when we are in the kernel and want to access the virtual address of kernel (4th GB), then do we still go through paging or we always use the identity map (__pa and __va)? if we go through paging, can you explain how the kernel itself can find the physical address for its virtual address. does cr3 register changes on the context switch from user to kernel?

    Thanks,
    Raghu

  95. Amit on February 6th, 2012 11:17 am

    Wonderful!
    The best article I’ve found… thank you.

  96. rakesh on March 13th, 2012 3:12 am

    i am great fan of ur articles, thx for making me understand all this stuff, grt work

  97. Robert on April 14th, 2012 6:59 pm

    Another awesome post!I like it and thanks for your effort in making these wonderful

  98. Gaurav on May 4th, 2012 2:07 am

    Hi Gustavo

    The links to subscribe to your blog are not working currently. Instead clicking on it starts showing the source-code of web-page.

  99. Gustavo Duarte on May 4th, 2012 5:17 am

    @Gaurav: fixed now, thanks for letting me know.

  100. Facundo on May 14th, 2012 9:28 pm

    Gustavo it’s awesome. An increible post. Thank you!.

  101. Thang Le on June 3rd, 2012 7:24 pm

    Hi Gustavo,

    in this article, you mentioned that: The processor consults page tables to translate a virtual address into a physical memory address.
    But with first 20 bits in PTE, we only can find respective page frame number in physical memory. So, i want know where we can get the “offset” in a page frame in physical memory ?

    Regards,
    Thang Le

  102. pisces on June 11th, 2012 3:36 pm

    @Thang Le:
    The offset is what you write in your program(low 12 bits).

  103. Thang Le on June 13th, 2012 7:29 am

    @pisces: Thanks for your answer

    Thang Le

  104. Mallesh Koujalagi on June 25th, 2012 12:19 am

    Great Article, Got clear picture, how linux managed the Memory. Thanks for such brief post.

  105. สุดยอด blogger และนักเขียนไอทีในความคิดของผม « Soowoi's Blog on June 26th, 2012 6:56 pm
  106. Daniel on July 22nd, 2012 12:21 am

    Great article, thanks a lot. One question remains, though: Why can’t I Flattr this?

  107. Jagannath Pattar on October 8th, 2012 1:29 am

    Fantastic presentation of memory management. Lovely and explanatory figures which makes you to understand the concept neatly.

  108. Handling Heap AllocationsVeda Solutions on February 6th, 2013 12:29 am

    [...] When a VMA is found the kernel must handle the fault by looking at the PTE contents and the type of VMA. In our case, the PTE shows the page is not present. In fact, our PTE is completely blank (all zeros), which in Linux means the virtual page has never been mapped. Since this is an anonymous VMA, we have a purely RAM affair that must be handled by do_anonymous_page(), which allocates a page frame and makes a PTE to map the faulted virtual page onto the freshly allocated frame. see here for detailed information on How kernel Memory [...]

  109. Deekshitha on March 20th, 2013 11:52 pm

    had a cleary of how the memories is distributed and how the addresses are fetched from the table.
    It would be helpful if you can describe,as to how the GOT(Global Offset TABLE) is segmented.and how to access it from the table.

  110. murugank on March 21st, 2013 6:07 pm

    Great Article..

  111. Thang Le on March 28th, 2013 3:51 am

    One thing confuse: How to manage if page size is 2MB/4MB and frame size is still 4KB ?.

    Thanks
    Thang Le

  112. Simon on April 21st, 2013 9:39 pm

    Hi, Thanks for ur awsome posts very much!
    I have one question.
    Within one process, every thread has own stack.
    what is layout of the stacks of threads in the process’ stack???
    Looking forward for ur reply!

    Many thx!

  113. emmanuel on May 3rd, 2013 11:06 am

    I can recommend.

Leave a Reply