Revision History
|
|
0.1 | Initial Draft |
0.2 | Added MHP driver information |
0.3 | Added Packaging information |
0.4 | Minor changes |
0.5 | Moved ACPI to Appendix |
0.6 | Minor fixes |
0.7 | Minor fixes |
1.0 | Final version |
This document will detail the changes needed in various sub-systems in Linux to add this feature. It will define the interfaces, list the important data structures that will be modified, state the dependencies on BIOS and hardware.
HAM | Hot-Add Memory |
MHP | (Hardware specific) Memory Hot Plug Driver |
PAE | Physical Address Extensions. The x86 architecture is limited by the design of its addressing modes and page tables to accessing 4 GB of virtual and physical memory. This is a hard limit. PAE mode allows the processor to address 64 GB of physical memory via the page tables, but does not change the size of virtual address space. |
VM | Virtual Memory subsystem |
1.2.1 HAM Module
The main purpose of the HAM driver module is to act as an interface between the Memory HotPlug Driver (MHP) and the VM subsystem.
1.2.2 Linux VM
The following picture depicts the important data structures in the Linux VM and how they are linked together.
Memory zones:
Linux divides the physical memory into three zones:
DMA | 0 - 16 MB |
Normal | 16 - 896 MB |
Highmem | > 896 MB |
A structure of type zone_t
is associated with each zone.
These are the important fields of the zone_t
structure :
typedef struct zone_struct {
unsigned long
free_pages;
unsigned long
pages_min, pages_low, pages_high;
/*
* free areas of
different sizes
*/
free_area_t
free_area[MAX_ORDER];
/*
* Discontig memory
support fields.
*/
struct pglist_data
*zone_pgdat;
struct page
*zone_mem_map;
unsigned long
zone_start_paddr;
unsigned long
zone_start_mapnr;
}
This structure contains the zone size, number of free pages and dirty pages, pointer to free list, etc. It also includes a pointer to the contiguous memory area containing the zone, and a pointer to the first page of the zone within a memory map of the area.
Contiguous page data:
A set of contiguous physical memory pages is represented by a structure
of type pg_data_t
. These are some of the important fields in the
pg_data_t
structure:
typedef struct pglist_data {
zone_t
node_zones[MAX_NR_ZONES];
zonelist_t
node_zonelists[GFP_ZONEMASK+1];
int
nr_zones;
struct page
*node_mem_map;
unsigned long
node_start_paddr;
unsigned long
node_start_mapnr;
unsigned long
node_size;
int
node_id;
struct pglist_data
*node_next;
} pg_data_t;
Structures describing the three zones are part of pg_data_t
. It also contains the starting physical address of the area, its size,
a pointer to the memory map of the area and a pointer to the next entry
in the linked list. A global variable, contig_page_data
, points to the first entry in the list.
In the current Linux VM implementation (i386 architecture), physical
memory from 0 to maximum physical address is represented by a single pg_data_t
structure. Even if the memory is physically discontiguous or if some of
the addresses are not usable (for example, used by the BIOS), multiple
structures are not used.
The hot-add memory design will use multiple pg_data_t
structures
placed in a linked list using the node_next
field (design details
in section 2.2).
Memory map:
Each physical memory page is represented by a structure of type mem_map_t
.
typedef struct page {
struct list_head
list; /* ->mapping
has some page lists. */
struct address_space
*mapping; /* The inode (or ...) we belong to. */
unsigned long
index;
/* Our offset within mapping. */
struct
buffer_head * buffers; /* Buffer maps us to a disk block. */
void
virtual; /* Kernel virtual address (NULL if not kmapped)
struct zone_struct
*zone; /* Memory zone we are in. */
.
.
} mem_map_t;
Data tracked by this structure include address space mapping of the
page, reference count, page aging, flags (clean/dirty/reserved etc.) and
a pointer to the zone to which the page belongs. For each contiguous memory
area initialized during boot, a corresponding array of mem_map_t
structures is allocated and initialized. All the unusable pages are marked
as reserved, so that they can never be allocated. The global variable mem_map
points to the first entry of the memory map corresponding to contig_page_data
.
Since the i386 architecture uses a single pg_data_t
structure,
a single array of mem_map_t
structures is used.
External Interface
The HAM module will provide the following external interfaces.
/proc/ham/status
This file can be used to check on the status of the installed memory ranges in the system. On a read, the output will be in the following format.
<address-range-1> <attribute> <status: ENABLED/FAILED>
<address-range-2> <attribute> <status: ENABLED/FAILED>
ENABLED - The memory range is present and has been integrated into the
system VM.
FAILED - The memory range is present, but a failure was reported by
the VM during integration.
/dev/ham
The HAM module provides ioctl
support through this device.
The following ioctls are supported:
HAM_INTEGRATE_MEMORY
This command can be used to re-enable the failed memory ranges. Integration
is attempted for the failed memory ranges.
HAM_GET_NUM_REGIONS
This command returns the number of memory regions dynamically added
to the system. Note that this number reflects both the enabled and failed
ranges.
HAM_GET_REGIONS
This command returns the attributes of memory regions as an array.
HAM_ADD_MEMORY
This command will not be supported in a production system. This will
be used for testing purposes.
Interface to VM subsystem
The HAM module will make use of an interface provided by the VM subsystem to indicate the addition of new memory ranges. It is the responsibility of the VM subsystem to check if the whole or part of the memory has already been added (during E820 initialization). The following parameters will be passed to the interface routine:
Start address (64 bits)
Size
(64 bits)
Read/Write Flag (8 bits)
The interface routine will be called for each memory range. The interface routine must return a success or failure code for each invocation. Trying to add a memory range that is already integrated into the system must return a success return code.
Interface to memory driver module (MHP)
The MHP driver module notifies the HAM module by an interface implemented by the HAM module. This also means that the HAM module must be loaded before the MHP driver module [to resolve symbols]. The parameters passed to the HAM module by the MHP module are
Start address (64 bits)
Size
(64 bits)
Read/Write Flag (8 bits)
This interface will return 0 on success and a non zero value on failure. The MHP driver can check for the return value to determine whether hot-added memory was successfully integrated into the system.
The following section describes the method of loading the HAM and MHP modules. It also details the means by which the MHP driver can dynamically obtain the interface exported by HAM module for hot-adding memory.
Loading the MHP and HAM modules
The MHP driver needs to call the HAM interface function to add the memory. The problem is that the HAM module may not be loaded when the MHP driver is loaded. Since the MHP driver refer to the function in the module, insmod will fail if HAM is not loaded. Also, we also need to prevent the HAM module from being unloaded after the MHP driver is loaded.
Here are 2 alternatives to solve this:
The HAM module will use the inter_module_register()
functionality
provided by the Linux kernel to export the function. The HAM module will
use the following code to register the interface function:
#define HAM_HOT_ADD "ham_hot_add"
inter_module_register(HAM_HOT_ADD, THIS_MODULE, (void *)ham_hot_add);
The MHP driver need to use the following code segment in init_module() to load and get the interface.
#define HAM_HOT_ADD "ham_hot_add"
typedef int (*ham_interface_t)(unsigned long long, unsigned long
long, unsigned char);
ham_interface_t ham_hot_add_func = NULL;
ham_hot_add_func = (ham_interface_t)inter_module_get_request(HAM_HOT_ADD,
"ham");
if (ham_hot_add_func == NULL) {
printk("Error! Could not load HAM module! \n");
return(-1);
}
The inter_module_get_request()
will load the HAM module (if
not already loaded) and get the interface function. Once this is done,
the HAM module is locked till we release the token. So, the HAM module
can't be unloaded till the MHP driver is unloaded first.
The cleanup_module() will contain:
inter_module_put(HAM_HOT_ADD);
And the interface function can be used by calling:
(*ham_hot_add_func) (start, size, attributes)
The function will return 0 on success and a non zero value otherwise.
Since this function can sleep, it should not be called from an interrupt context.
Data Structures & Functions
A memory range will be represented by the following structure:
typedef struct ham_range {
struct list_head
list;
/*linked list of memory ranges */
u64
start_address;
/* start of memory range */
u64
size;
/* end of memory range */
u8
attribute;
/* read/write */
u8
status;
/* present/enabled/failed */
} HAM_RANGE;
Global Variables
static struct proc_dir *ham_proc_root;
This represents the process directory entry for /proc/ham
.
static struct proc_dir *ham_proc_status;
This variable represents the proc directory entry for /proc/ham/status.
static list_head ham_list;
This represents the linked list of all the memory devices in the system.
static rwlock_t ham_list_lock;
This is a reader/writer spinlock that is used to serialize transactions
between memory notifications and /proc/ham/status
reads for the
memory device list, ham_list
.
Functions
ham_init():
This function performs the main initialization for the HAM module. It
creates the /proc/ham/status
file entry. It also initializes the
ham_list and allocates other data structures. It also registers the ham
module as a character device driver. Once the module is loaded, the /dev/ham
file
entry can be created by perusing /proc/devices and getting the major number
for the /dev/ham device.
ham_exit()
:
This function is called when a HAM module is unloaded. Right now, module unload is not supported. So, this function will return an error. It can be used to clean up and de-allocate the HAM driver resources and delete the /proc file entries. It can also un-register the character device driver.
ham_proc_read_status():
This function is called when a read is done on /proc/ham/status
to display the status of memory ranges. The list of memory ranges is perused
and the status is displayed. Mutual exclusion for the ham_list is provided
by the spinlock, ham_list_lock
.
ham_ioctl():
This function provides the ioctl()
support for /dev/ham
. It supports the ioctl()
calls described above.
ham_integrate_memory()
:
This function calls the VM interface with the memory range parameters. The VM interface function needs to return a success or failure code. If the result is a failure, then the status of the memory range is moved to FAILED.
ham_add_memory():
This function is called by ham_hot_add()
to implement addition
of new memory ranges. The input parameters are start address, size and
read/write flags. It creates a new HAM_RANGE
structure to hold
the memory range and adds the memory range to the list. It then calls ham_integrate_memory()
to integrate the new memory range.
These functions will be added to the VM subsystem:
hotadd_mem_init():
This is the interface function to be called to integrate hot-added memory into the system. It performs the following steps:
Note: The page structures needed by the newly added memory take up significant amount of memory. On a running system, it is not possible to obtain large amounts of contiguous memory (either physical memory or virtual address range). Several alternatives are considered in section [4.1] . Our current approach is to reserve sufficient virtual addresses during startup and mapping parts of the newly added memory to this range. The above step makes sure that the reserved virtual address range is not exhausted.
hotadd_mem_bootstrap()
to initialize part of the added memory.hotadd_init_pgdat()
to initialize data structures and add
new memory to the free list.hotadd_init_done()
to update global variables and complete
the initialization.hotadd_mem_bootstrap():
This function initializes part of the newly added memory. Data structures needed to represent the hot-added memory is stored in this part.
(Size ofstruct page
) X (number of pages)
hotadd_init_pgdat():
This function creates and initializes data structures to represent the hot-added memory. It also adds the new memory pages to the free list.
init_pgdat()
with appropriate parameters. This function initializes
the pg_data_t
structure and page structures for all pages. It
also builds the zone lists and the free list bitmap.pg_data_t
structure to the end of a global linked
list.__free_page()
for each
page structure.hotadd_init_done()
:
pg_data_t
, which represents a contiguous
area of memory
Allocating memory for these data structures from the existing free pool
will limit the size of added memory. Instead, only pg_data_t
will
be allocated using kmalloc()
and the array of page structures
will come from the newly added physical memory.
The following global variables will be modified to reflect the newly
added physical memory:
Variable | Description |
num_physpages |
Total physical memory in pages, including reserved pages |
numnodes |
Number of pg_data_t structures in the global linked list. (On a NUMA system, this represents the number of CPU nodes) |
totalram_pages |
Total physical memory, except pages reserved at boot time |
highend_pfn |
Highest page frame number in HIGHMEM zone |
max_mapnr |
Maximum page frame number |
totalhigh_pages |
Total memory in HIGHMEM zone |
The following existing VM functions and macros will be modified:
Function: free_area_init_core()
File: mm/page_alloc.c
Called at boot time to initialize memory management data structures.
Parts of this function will be moved to init_pgdat()
, so that
it can be used to initialize data structures needed for hot-added memory.
Function: setup_arch()
File: arch/i386/kernel/setup.c
Does architecture specific initialization, including setting memory related parameters. Code will be added to reserve virtual address space needed for hot-add operation.
Function: alloc_pages()
File: include/linux/mm.h
Allocates requested number of pages. This function assumes that a single memory region exists in the system and tries to allocate memory from that region. It will be modified to check free pages in the required zone in all memory regions.
Macro: VMALLOC_START
File: include/asm-i386/pgtable.h
Defines the starting virtual address to be used by vmalloc()
. It will be changed to account for the reserved virtual address range.
The following macros assume that only a single memory region exists. They will be modified to handle multiple memory regions.
Macro: page_to_phys()
File: include/asm-i386/io.h
Returns the physical address corresponding to a page structure. Assumes that only a single contiguous memory region exists.
Macro: mk_pte()
File: include/asm-i386/pgtable.h
Creates a PTE entry corresponding to a page structure.
Macro: VALID_PAGE()
File: include/asm-i386/page.h
Verifies that a given pointer refers to a valid page structure.
Macro: pte_page()
File: include/asm-i386/pgtable-2level.h (non-PAE mode)
include/asm-i386/pgtable-3level.h
(PAE mode)
Returns the page structure corresponding to a PTE entry.
Macro: BAD_RANGE()
File: mm/page_alloc.c
Verifies that a given page structure belongs to a valid zone.
Documentation/Configure.help
arch/i386/config.in
arch/i386/kernel/setup.c
arch/i386/mm/Makefile
arch/i386/mm/fault.c
arch/i386/mm/mem_hotadd.c [New file]
arch/i386/mm/init.c
include/asm-i386/mem_hotadd.h [New file]
include/asm-i386/io.h
include/asm-i386/page.h
include/asm-i386/pgtable-2level.h
include/asm-i386/pgtable-3level.h
include/asm-i386/pgtable.h
include/asm-i386/pci.h
include/linux/mm.h
include/linux/mmzone.h
mm/page_alloc.c
totalram_pages
, total_highpages
and functions like nr_free_pages()
to report memory usage to the
user. As mentioned in section [Data Structures and Global Variables], the
VM subsystem will update all the relevant global variables, functions and
macros once memory is hot-added. The /proc interface will work correctly
after this is done.
The page structures should be in a contiguous physical or virtual address range within the low memory region (< 1GB). On a running system, it is highly unlikely that 16MB or more free memory is available in this region. Other alternatives to be considered are:
1.Putting the data structures in the newly added memory itself.
Even though it solves the issue of physical memory availability, contiguous
virtual address range may still not be available.
Cons:
This approach also requires increasing the number of permanent kmap
entries - thus reducing address space available for vmalloc and/or kmalloc.
2.Go through the complete list of low memory pages to obtain maximum
available memory (i.e. do not use kmalloc/vmalloc).
Cons:
Availability of free memory is not guaranteed.
Huge performance hit.
Even when memory is available, contiguous virtual address range may
still not be available.
3.Use 4MB page size, thus reducing the required number of page structures.
Cons:
Need to maintain a separate set of data structures and allocator.
Limited usability.
Complex to implement.
4.Reserve the Virtual space in the low memory range at boot time.
With this approach, we can reserve the memory for the hot-added page
structures at boot time. Only the Kernel Virtual address range will be
reserved and this will be used to store the page structures once memory
is hot-added. Reserving 512 MB of kernel virtual space will make it possible
to add 32 GB of physical memory.
Cons: The virtual memory range available to the kernel will be reduced
if we reserve the range at boot time.
5.Decrease the value of the PAGE_OFFSET macro in the kernel. This will
increase the virtual memory space available to the kernel.
Cons: This will reduce the User Virtual address space available to
the applications.
After considering all the options, we have chosen to implement option 4. We will be reserving sufficient virtual address space in kernel when system is booted after installing the hot-add package. For details, see section [Data Structures and Global Variables].
The benchmark suite chosen to highlight this performance impact is UnixBench available at http://www.tux.org/pub/tux/benchmarks/System/unixbench/ .
The table illustrates the difference in performance between a non-PAE kernel and a PAE-enabled kernel. The specifications of the test system are as follows:
Processor: 2-way SMP Pentium-III 1Gz
Memory: 1 GB RAM
Model: Compaq ProLiant DL380
Distribution: RedHat 7.1
Kernel: 2.4.13
|
|
|
Dhrystone 2 | 180.6 | 180.5 |
Double-precision Whetstone | 97.7 | 97.6 |
Execl Throughput | 274.8 | 250.1 |
File Copy (1024 bufsize) | 328.3 | 318.9 |
File Copy (256 bufsize) | 373.4 | 366.8 |
File Copy (4096 bufsize) | 295.2 | 292.1 |
Pipe Throughput | 397.8 | 387.4 |
Process Creation | 391.5 | 300.7 |
Shell Scripts (8 concurrent) | 610.2 | 576.8 |
System Call Overhead | 311.5 | 292.6 |
FINAL SCORE | 296.2 | 280.0 |
From the final score, there is a performance degradation of 4.1% (the usual range is 3-6%). It can also be observed that the main difference stems from "Process Creation", which is quite worse with PAE, because the 'density' of the page-tables is half of that of non-PAE page tables (i.e, twice as much has to be copied).
Module | Installation mode | Comments |
HAM module | As a source RPM | This will be a loadable module. |
VM modifications | Installed as source patch | Installation of this patch will require a kernel rebuild and system reboot. |
Different flavors of Linux can be supported as long as the following requirements are met