HOT-ADD MEMORY ARCHITECTURE AND DESIGN


Version 1.0

Revision History

Revision 
Comments 
0.1  Initial Draft
0.2  Added MHP driver information 
0.3  Added Packaging information 
0.4  Minor changes 
0.5  Moved ACPI to Appendix 
0.6  Minor fixes 
0.7  Minor fixes 
1.0  Final version 

1.0    Introduction

This is an architecture and high level design document for the Linux Hot-Add memory feature. The aim of the project is to add support in the Linux kernel so that physical memory can be added to a system without rebooting the machine. This project will not support hot removing or replacing physical memory.

This document will detail the changes needed in various sub-systems in Linux to add this feature. It will define the interfaces, list the important data structures that will be modified, state the dependencies on BIOS and hardware.

1.1    Glossary

HAM  Hot-Add Memory 
MHP  (Hardware specific) Memory Hot Plug Driver 
PAE  Physical Address Extensions. The x86 architecture is limited by the design of its addressing modes and  page tables to accessing 4 GB of virtual and physical memory. This is a hard  limit. PAE mode allows the processor to address 64 GB of physical memory via the page tables, but does not change the size of virtual address space. 
VM  Virtual Memory subsystem 

1.2    Components

This section describes HAM and Linux Virtual Memory(VM) - the two major  components in the hot-add memory design. The current Linux VM  implementation  is described along with the important data structures.

1.2.1    HAM Module

The main purpose of the HAM driver module is to act as an interface between   the Memory HotPlug Driver (MHP) and the VM subsystem.

1.2.2    Linux VM

The following picture depicts the important data structures in the Linux VM and how they are linked together.

Memory zones:

Linux divides the physical memory into three zones:
DMA  0 - 16 MB 
Normal  16 - 896 MB 
Highmem  > 896 MB 

A structure of type zone_t is associated with each zone.  These are the important fields of the zone_t structure :

typedef struct zone_struct {
        unsigned long    free_pages;
        unsigned long    pages_min, pages_low, pages_high;

        /*
         * free areas of different sizes
         */
        free_area_t        free_area[MAX_ORDER];

        /*
         * Discontig memory support fields.
         */
        struct pglist_data    *zone_pgdat;
        struct page        *zone_mem_map;
        unsigned long    zone_start_paddr;
        unsigned long    zone_start_mapnr;
}

This structure contains the zone size, number of free pages and dirty pages, pointer to free list, etc. It also includes a pointer to the contiguous memory area containing the zone, and a pointer to the first page of the zone within a memory map of the area.

Contiguous page data:
A set of contiguous physical memory pages is represented by a structure of type pg_data_t. These are some of the important fields in the pg_data_t structure:

typedef struct pglist_data {
        zone_t            node_zones[MAX_NR_ZONES];
        zonelist_t            node_zonelists[GFP_ZONEMASK+1];
        int            nr_zones;
        struct page        *node_mem_map;
        unsigned long        node_start_paddr;
        unsigned long        node_start_mapnr;
        unsigned long        node_size;
        int            node_id;
        struct pglist_data        *node_next;
} pg_data_t;

Structures describing the three zones are part of pg_data_t . It also contains the starting physical address of the area, its size, a pointer to the memory map of the area and a pointer to the next entry in the linked list.  A global variable, contig_page_data , points to the first entry in the list.
In the current Linux VM implementation (i386 architecture), physical memory from 0 to maximum physical address is represented by a single pg_data_t   structure. Even if the memory is physically discontiguous or if some of the addresses are not usable (for example, used by the BIOS), multiple structures are not used.
The hot-add memory design will use multiple pg_data_t structures placed in a linked list using the node_next field (design details in section 2.2).

Memory map:
Each physical memory page is represented by a structure of type mem_map_t.

typedef struct page {
        struct list_head        list;          /* ->mapping has some page lists. */
        struct address_space     *mapping;  /* The inode (or ...) we belong to. */
        unsigned long        index;            /* Our offset within mapping. */
        struct            buffer_head * buffers;   /* Buffer maps us to a disk block. */
        void            virtual; /* Kernel virtual address (NULL if not kmapped)
        struct zone_struct    *zone;       /* Memory zone we are in. */
        .
        .
} mem_map_t;

Data tracked by this structure include address space mapping of the page, reference count, page aging, flags (clean/dirty/reserved etc.) and a pointer to the zone to which the page belongs. For each contiguous memory area initialized during boot, a corresponding array of mem_map_t structures is allocated and initialized. All the unusable pages are marked as reserved, so that they can never be allocated. The global variable mem_map   points to the first entry of the memory map corresponding to contig_page_data .
Since the i386 architecture uses a single pg_data_t structure, a single array of mem_map_t structures is used.

2.0    High Level Architecture & Design

This section describes the design details of the various subsystems involved in implementing the hot-add feature. It also lists the global data structures, routines, macros that will be modified to support the added memory.

2.1    Linux Hot-Add Memory (HAM) Driver

As mentioned in Section [1.2.1], the main purpose of the HAM driver module is to act as an interface between the Memory HotPlug Driver (MHP) and the VM subsystem.

2.1.1    Linux Hot-Add Memory (HAM) module

The HAM driver will be a standalone loadable module. The main purpose of this module is to interface with the Memory Hot-Plug device driver module (MHP) and the VM subsystem. It will facilitate the induction of the new memory region into the system pool by interfacing with the VM subsystem. In short, the following are the responsibilities of the HAM module. The HAM module will be initialized along with other memory device drivers at boot time. The words "memory range" and "memory region" mean the same and have been used interchangeably in this section. Also, note that the actual memory devices will be recognized and operated on by the MHP driver. This module acts as an intermediary between the MHP driver and VM subsystem and provides some control over the memory ranges.

External Interface
The HAM module will provide the following external interfaces.

/proc/ham/status

This file can be used to check on the status of the installed memory ranges in the system. On a read, the output will be in the following format.

<address-range-1>  <attribute> <status: ENABLED/FAILED>
<address-range-2>  <attribute> <status: ENABLED/FAILED>

ENABLED - The memory range is present and has been integrated into the system VM.
FAILED - The memory range is present, but a failure was reported by the VM during integration.

/dev/ham
The HAM module provides ioctl support through this device. The following ioctls are supported:

HAM_INTEGRATE_MEMORY
This command can be used to re-enable the failed memory ranges. Integration is attempted for the failed memory ranges.

HAM_GET_NUM_REGIONS
This command returns the number of memory regions dynamically added to the system. Note that this number reflects both the enabled and failed ranges.

HAM_GET_REGIONS
This command returns the attributes of memory regions as an array.

HAM_ADD_MEMORY
This command will not be supported in a production system. This will be used for testing purposes.

Interface to VM subsystem

The HAM module will make use of an interface provided by the VM subsystem to indicate the addition of new memory ranges. It is the responsibility of the VM subsystem to check if the whole or part of the memory has already been added (during E820 initialization).  The following parameters will be passed to the interface routine:

Start address (64 bits)
Size             (64 bits)
Read/Write Flag (8 bits)

The interface routine will be called for each memory range. The interface routine must return a success or failure code for each invocation. Trying to add a memory range that is already integrated into the system must return a success return code.

Interface to memory driver module (MHP)

The MHP driver module notifies the HAM module by an interface implemented by the HAM module. This also means that the HAM module must be loaded before the MHP driver module [to resolve symbols]. The parameters passed to the HAM module by the MHP module are

Start address (64 bits)
Size             (64 bits)
Read/Write Flag (8 bits)

This interface will return 0 on success and a non zero value on failure. The MHP driver can check for the return value to determine whether hot-added memory was successfully integrated into the system.

The following section describes the method of loading the HAM and MHP modules. It also details the means by which the MHP driver can dynamically obtain the interface exported by HAM module for hot-adding memory.

Loading the MHP and HAM modules

The MHP driver needs to call the HAM interface function to add the memory. The problem is that the HAM module may not be loaded when the MHP driver is loaded. Since the MHP driver refer to the function in the module, insmod will fail if HAM is not loaded. Also, we also need to prevent the HAM module from being unloaded after the MHP driver is loaded.

Here are 2 alternatives to solve this:

  1. Have a script load the HAM module first and then the MHP driver.

  2. Cons: This requires the user to run a specific script at boot-time and supplying the script as part of the MHP driver.
  3. Use a OS-specific method using which the HAM & MHP driver communicate and which does not require the HAM to be loaded.

  4. One such way on Linux is the inter-module data exchange using inter_module_*() functions. This also shields the HAM driver function that actually implements the hot-add. So, even if the function name changes in the HAM module, the MHP driver does not have to be modified.
Interface description

The HAM module will use the inter_module_register() functionality provided by the Linux kernel to export the function. The HAM module will use the following code to register the interface function:

#define HAM_HOT_ADD    "ham_hot_add"
inter_module_register(HAM_HOT_ADD, THIS_MODULE, (void *)ham_hot_add);

The MHP driver need to use the following code segment in init_module() to load and get the interface.

#define HAM_HOT_ADD     "ham_hot_add"
typedef int (*ham_interface_t)(unsigned long long, unsigned long long, unsigned char);
ham_interface_t ham_hot_add_func = NULL;

ham_hot_add_func = (ham_interface_t)inter_module_get_request(HAM_HOT_ADD, "ham");
if (ham_hot_add_func == NULL) {
  printk("Error! Could not load HAM module! \n");
  return(-1);
}

The inter_module_get_request() will load the HAM module (if not already loaded) and get the interface function. Once this is done, the HAM module is locked till we release the token. So, the HAM module can't be unloaded till the MHP driver is unloaded first.

The cleanup_module() will contain:
inter_module_put(HAM_HOT_ADD);

And the interface function can be used by calling:

(*ham_hot_add_func) (start, size, attributes)

The function will return 0 on success and a non zero value otherwise.

Since this function can sleep, it should not be called from an interrupt context.

Data Structures & Functions

A memory range will be represented by the following structure:

typedef struct ham_range  {
       struct    list_head    list;            /*linked list of memory ranges */
       u64            start_address;                        /* start of memory range */
       u64            size;                       /* end of memory range */
       u8                attribute;                               /* read/write */
       u8                status;                                   /* present/enabled/failed */
} HAM_RANGE;

Global Variables

static struct proc_dir *ham_proc_root;

This represents the process directory entry for /proc/ham .

static struct proc_dir *ham_proc_status;

This variable represents the proc directory entry for /proc/ham/status.

static  list_head     ham_list;

This represents the linked list of all the memory devices in the system.

static  rwlock_t  ham_list_lock;

This is a reader/writer spinlock that is used to serialize transactions between memory notifications and /proc/ham/status reads for the memory device list, ham_list.

Functions

ham_init():

This function performs the main initialization for the HAM module. It creates the /proc/ham/status file entry. It also initializes the ham_list and allocates other data structures. It also registers the ham module as a character device driver. Once the module is loaded, the /dev/ham file entry can be created by perusing /proc/devices and getting the major number for the /dev/ham device.

ham_exit():

This function is called when a HAM module is unloaded. Right now, module unload is not supported. So, this function will return an error. It can be used to clean up and de-allocate the HAM driver resources and delete the /proc file entries. It can also un-register the character device driver.

ham_proc_read_status():

This function is called when a read is done on /proc/ham/status to display the status of memory ranges. The list of memory ranges is perused and the status is displayed. Mutual exclusion for the ham_list is provided by the spinlock, ham_list_lock.

ham_ioctl():

This function provides the ioctl() support for /dev/ham . It supports the ioctl() calls described above.

ham_integrate_memory():

This function calls the VM interface with the memory range parameters. The VM interface function needs to return a success or failure code. If the result is a failure, then the status of the memory range is moved to FAILED.

ham_add_memory():

This function is called by ham_hot_add() to implement addition of new memory ranges. The input parameters are start address, size and read/write flags. It creates a new HAM_RANGE structure to hold the memory range and adds the memory range to the list. It then calls ham_integrate_memory() to integrate the new memory range.

2.2 Linux VM

The VM will do the following steps on receiving a "New memory device" indication from the HAM module: Each of these steps is described in detail below.

These functions will be added to the VM subsystem:

hotadd_mem_init():

This is the interface function to be called to integrate hot-added memory into the system. It performs the following steps:

Note: The page structures needed by the newly added memory take up significant amount of memory. On a running system, it is not possible to obtain large amounts of contiguous memory (either physical memory or virtual address range). Several alternatives are considered in section [4.1] . Our current approach is to reserve sufficient virtual addresses during startup and mapping parts of the newly added memory to this range. The above step makes sure that the reserved virtual address range is not exhausted.
hotadd_mem_bootstrap():

This function initializes part of the newly added memory. Data structures needed to represent the hot-added memory is stored in this part.

(Size of struct page) X (number of pages)
hotadd_init_pgdat():

This function creates and initializes data structures to represent the hot-added memory. It also adds the new memory pages to the free list.

hotadd_init_done():

2.2.1    Data Structures and Global Variables

The newly added memory is represented by the following data structures:
- a structure of type pg_data_t, which represents a contiguous area of memory
- a bitmap of size corresponding to the size of added memory
- an array of 'page' structures (one per each page of added memory)

Allocating memory for these data structures from the existing free pool will limit the size of added memory. Instead, only pg_data_t will be allocated using kmalloc() and the array of page structures will come from the newly added physical memory.

The following global variables will be modified to reflect the newly added physical memory:
 
Variable Description
num_physpages Total physical memory in pages, including reserved pages
numnodes Number of pg_data_t structures in the global linked list. (On a NUMA system, this represents the number of CPU nodes)
totalram_pages Total physical memory, except pages reserved at boot time
highend_pfn Highest page frame number in HIGHMEM zone
max_mapnr Maximum page frame number
totalhigh_pages Total memory in HIGHMEM zone

The following existing VM functions and macros will be modified:

Function: free_area_init_core()
File: mm/page_alloc.c

Called at boot time to initialize memory management data structures. Parts of this function will be moved to init_pgdat(), so that it can be used to initialize data structures needed for hot-added memory.

Function: setup_arch()
File: arch/i386/kernel/setup.c

Does architecture specific initialization, including setting memory related parameters. Code will be added to reserve virtual address space needed for hot-add operation.

Function: alloc_pages()
File: include/linux/mm.h

Allocates requested number of pages. This function assumes that a single memory region exists in the system and tries to allocate memory from that region. It will be modified to check free pages in the required zone in all memory regions.

Macro: VMALLOC_START
File: include/asm-i386/pgtable.h

Defines the starting virtual address to be used by vmalloc() . It will be changed to account for the reserved virtual address range.
 

The following macros assume that only a single memory region exists. They will be modified to handle multiple memory regions.

Macro: page_to_phys()
File: include/asm-i386/io.h

Returns the physical address corresponding to a page structure. Assumes that only a single contiguous memory region exists.

Macro: mk_pte()
File: include/asm-i386/pgtable.h

Creates a PTE entry corresponding to a page structure.

Macro: VALID_PAGE()
File: include/asm-i386/page.h

Verifies that a given pointer refers to a valid page structure.

Macro: pte_page()
File: include/asm-i386/pgtable-2level.h (non-PAE mode)
           include/asm-i386/pgtable-3level.h (PAE mode)

Returns the page structure corresponding to a PTE entry.

Macro: BAD_RANGE()
File: mm/page_alloc.c

Verifies that a given page structure belongs to a valid zone.

2.2.2    Source Files and Build Environment

Existing Linux build environment will be used to compile and build the Linux VM. The new functions, macros will be added to the relevant existing files in the linux/mm  and linux/arch/i386/mm directories in the kernel tree. The following files are expected to be modified in the kernel tree:

Documentation/Configure.help
arch/i386/config.in
arch/i386/kernel/setup.c
arch/i386/mm/Makefile
arch/i386/mm/fault.c
arch/i386/mm/mem_hotadd.c     [New file]
arch/i386/mm/init.c
include/asm-i386/mem_hotadd.h   [New file]
include/asm-i386/io.h
include/asm-i386/page.h
include/asm-i386/pgtable-2level.h
include/asm-i386/pgtable-3level.h
include/asm-i386/pgtable.h
include/asm-i386/pci.h
include/linux/mm.h
include/linux/mmzone.h
mm/page_alloc.c

2.2.3   Kernel Buffers

This module will not explicitly grow any kernel buffers after memory is hot-added to the system. To optimize performance the user can tune the variables after hot-adding memory by using the /proc interface.
 

2.2.4   External Interface

/proc/meminfo is the interface that reports the total, used and free memory available on the system along with other statistics.  This interface uses kernel global variables like totalram_pages, total_highpages and functions like nr_free_pages() to report memory usage to the user. As mentioned in section [Data Structures and Global Variables], the VM subsystem will update all the relevant global variables, functions and macros once memory is hot-added. The /proc interface will work correctly after this is done.
 

3.0   Tools and Utilities

Tools like sar and vmstat use the /proc interface to report memory statistics. These tools will continue to work as is after adding the new memory.
 

4.0    Limitations and Assumptions

Assumptions Limitations

4.1   Physical memory limitation

The array of page structures, required to represent physical memory, takes up a lot of space (~16MB for every Gigabyte of memory). A machine with 64GB of RAM needs 1GB for storing page structures. This is impossible in Linux under default configuration since the address space of the kernel is only 1GB. So the total RAM that can be supported is less than the CPU-specific limit of 64GB.

The page structures should be in a contiguous physical or virtual address range within the low memory region (< 1GB). On a running system, it is highly unlikely that 16MB or more free memory is available in this region. Other alternatives to be considered are:

1.Putting the data structures in the newly added memory itself.
Even though it solves the issue of physical memory availability, contiguous virtual address range may still not be available.
Cons:
This approach also requires increasing the number of permanent kmap entries - thus reducing address space available for vmalloc and/or kmalloc.

2.Go through the complete list of low memory pages to obtain maximum available memory (i.e. do not use kmalloc/vmalloc).
Cons:
Availability of free memory is not guaranteed.
Huge performance hit.
Even when memory is available, contiguous virtual address range may still not be available.

3.Use 4MB page size, thus reducing the required number of page structures.
Cons:
Need to maintain a separate set of data structures and allocator.
Limited usability.
Complex to implement.

4.Reserve the Virtual space in the low memory range at boot time.
With this approach, we can reserve the memory for the hot-added page structures at boot time. Only the Kernel Virtual address range will be reserved and this will be used to store the page structures once memory is hot-added. Reserving 512 MB of kernel virtual space will make it possible to add 32 GB of physical memory.
Cons: The virtual memory range available to the kernel will be reduced if we reserve the range at boot time.

5.Decrease the value of the PAGE_OFFSET macro in the kernel. This will increase the virtual memory space available to the kernel.
Cons: This will reduce the User Virtual address space available to the applications.

After considering all the options, we have chosen to implement option 4. We will be reserving sufficient virtual address space in kernel when system is booted after installing the hot-add package. For details, see section [Data Structures and Global Variables].

5.0   Dependencies

The Hot-add memory feature implementation in Linux will depend on the hardware.

5.1  Hardware Requirements

6.0   Miscellaneous

This section lists some performance implications of running PAE enabled kernel in Linux and details the benchmark results we have run .
 

6.1   PAE Support in Linux

6.1.1    Impact of enabling PAE on Linux VM performance

Addressing physical memory above 4 GB on 32-bit Intel Pentium processor needs a Linux kernel running with PAE mode enabled. To enable the PAE mode, the kernel needs to be recompiled with the CONFIG_HIGHMEM and CONFIG_HIGHMEM64G flags enabled. A PAE-enabled kernel uses three level page tables for VM address translation instead of two level page tables used by a non-PAE-enabled kernel. This can cause some performance impact that can be observed by running a benchmark suite that stresses the VM subsystem.

The benchmark suite chosen to highlight this performance impact is UnixBench available at  http://www.tux.org/pub/tux/benchmarks/System/unixbench/ .

The table illustrates the difference in performance between a non-PAE kernel and a PAE-enabled kernel. The specifications of the test system are as follows:

Processor: 2-way SMP Pentium-III 1Gz
Memory: 1 GB RAM
Model:  Compaq ProLiant DL380
Distribution: RedHat 7.1
Kernel:  2.4.13
 
 
Test Name
Non-PAE Kernel
PAE-Kernel
Dhrystone 2 180.6 180.5
Double-precision Whetstone 97.7 97.6
Execl Throughput 274.8 250.1
File Copy (1024 bufsize) 328.3 318.9
File Copy (256 bufsize) 373.4 366.8
File Copy (4096 bufsize) 295.2 292.1
Pipe Throughput 397.8 387.4
Process Creation 391.5 300.7
Shell Scripts (8 concurrent) 610.2 576.8
System Call Overhead 311.5 292.6
FINAL SCORE 296.2 280.0

From the final score, there is a performance degradation of 4.1% (the usual range is 3-6%). It can also be observed that the main difference stems from "Process Creation", which is quite worse with PAE, because the 'density' of the page-tables is half of that of non-PAE page tables (i.e, twice as much has to be copied).

6.2   Packaging and Release

The following modules and kernel subsystems are the different components which will be released in the Hot-Add memory package ?
 
 
Module Installation mode Comments
HAM module As a source RPM This will be a loadable module.
VM modifications Installed as source patch Installation of this patch will require a kernel rebuild and system reboot.

Different flavors of Linux can be supported as long as the following requirements are met