文章目录
- 1. __alloc_pages_direct_compact
- 1.1 compaction_suitable
- 1.2 compact_finished
- 1.3 isolate_migratepages
- 1.4 migrate_pages
- 1.4.1 __unmap_and_move
- 1.4.1.1 move_to_new_page
- 1.4.1.1.1 migrate_page
- 1.4.1.1.2 fallback_migrate_page
- 1.4.1.2 remove_migration_ptes
- 1.4.2 compaction_alloc
- 1.4.3 compaction_free
内存碎片整理( memory compaction ,直译为“内存压缩”,意译为“内存碎片整理”)的基本思想是:从内存区域的底部扫描已分配的可移动页,从内存区域的顶部扫描空闲页,把底部的可移动页移到顶部的空闲页,在底部形成连续的空闲页。
内存碎片整理的算法如下:
- 首先从内存区域的底部向顶部以页块为单位扫描,在页块内部从起始页向结束页扫描,把这个页块里面的可移动页组成一条链表。
- 然后从内存区域的顶部向底部以页块为单位扫描,在页块内部也是从起始页向结束页扫描,把空闲页组成一条链表。
- 最后把底部的可移动页的数据复制到顶部的空闲页,修改进程的页表,把虚拟页映射到新的物理页。
内存碎片整理有 3 种优先级,从高到低依次如下所示:
- COMPACT_PRIO_SYNC_FULL :完全同步模式,允许阻塞,允许把脏的文件页回写到存储设备上,并且等待回写完成。
- COMPACT_PRIO_SYNC_LIGHT :轻量级同步模式,允许大多数操作阻塞,但是不允许把脏的文件页回写到存储设备上。
- COMPACT_PRIO_ASYNC :异步模式,不允许阻塞。
完全同步模式的成本最高,轻量级同步模式的成本其次,异步模式的成本最低。在慢速路径的内存申请过程中,如果条件满足,第一次调用__alloc_pages_direct_compact是使用成本最低的异步模式。第二次也是使用异步模式,如果第一次调钱不满足,第二次会使用同步的方式调用内存碎片整理。compact_priority参数决定优先级的,可以自己回头看看代码。
1. __alloc_pages_direct_compact
这里开始看看__alloc_pages_direct_compact是怎么进行内存碎片整理的:
static struct page *
__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
unsigned int alloc_flags, const struct alloc_context *ac,
enum compact_priority prio, enum compact_result *compact_result)
{
struct page *page = NULL;
unsigned long pflags;
unsigned int noreclaim_flag;
if (!order)
return NULL;
psi_memstall_enter(&pflags);//空函数
//设置current的flags为PF_MEMALLOC,表示在内存碎片整理进行中
noreclaim_flag = memalloc_noreclaim_save();
//尝试压缩页面
*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
prio, &page);
memalloc_noreclaim_restore(noreclaim_flag);//恢复current的flags,表示compact结束
psi_memstall_leave(&pflags);//空函数
count_vm_event(COMPACTSTALL);//统计压缩事件数量加一
if (page)//如果压缩页面获取了一些page
//清除page的一些标志等信息
prep_new_page(page, order, gfp_mask, alloc_flags);
if (!page)//如果压缩页面失败
//尝试从伙伴系统的空闲列表中分配物理内存
page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
if (page) {//如果分配到内存
struct zone *zone = page_zone(page);//申请到的物理内存的zone
zone->compact_blockskip_flush = false;
compaction_defer_reset(zone, order, true);//压缩延期时间重置
count_vm_event(COMPACTSUCCESS);//压缩事件次数加一
return page;
}
/*
* It's bad if compaction run occurs and fails. The most likely reason
* is that pages exist, but not enough to satisfy watermarks.
*/
count_vm_event(COMPACTFAIL);//压缩事件次数加一
cond_resched();//睡眠,调度出去,等待页面回收后继续慢速路径内存申请
return NULL;
}
__alloc_pages_direct_compact执行步骤如下:
- 直接调用函数try_to_compact_pages尝试压缩页面
- 如果压缩获取到一些page,调用函数prep_new_page清除这些page的标志状态,然后跳到4,
- 如果压缩失败,调用函数get_page_from_freelist尝试从伙伴系统中分配内存,
- 如果步骤2或者步骤3获取到内存,调用函数compaction_defer_reset重置zone的内存碎片整理相关的成员,
- 如果获取不到内存,调用函数cond_resched调度出去,等待页面回收后继续慢速路径内存申请,返回NULL
只有try_to_compact_pages函数值得我们继续看,其他都是小东西:
enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
unsigned int alloc_flags, const struct alloc_context *ac,
enum compact_priority prio, struct page **capture)
{
int may_perform_io = gfp_mask & __GFP_IO;
struct zoneref *z;
struct zone *zone;
enum compact_result rc = COMPACT_SKIPPED;
if (!may_perform_io)//如果调用者不允许写存储设备
return COMPACT_SKIPPED;//不执行内存碎片整理
trace_mm_compaction_try_to_compact_pages(order, gfp_mask, prio);
//遍历所有的zone
for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
ac->highest_zoneidx, ac->nodemask) {
enum compact_result status;
if (prio > MIN_COMPACT_PRIORITY//如果内存碎片整理的优先级不是完全同步,
&& compaction_deferred(zone, order)) {//这个zone的碎片整理应该延期
rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
continue;//该内存区域不执行内存碎片整理,下一个区域
}
//在该内存区域执行内存碎片整理
status = compact_zone_order(zone, order, gfp_mask, prio,
alloc_flags, ac->highest_zoneidx, capture);
rc = max(status, rc);
/* The allocation should succeed, stop compacting */
if (status == COMPACT_SUCCESS) {//如果压缩成功
compaction_defer_reset(zone, order, false);//修改压缩标志位
break;//退出循环,停止压缩
}
//如果压缩的优先级不是异步模式,并且压缩过了但是压缩失败或者延期
if (prio != COMPACT_PRIO_ASYNC && (status == COMPACT_COMPLETE ||
status == COMPACT_PARTIAL_SKIPPED))
defer_compaction(zone, order);//更新推迟压缩的信息
//如果压缩的优先级是异步模式,并且进程需要重新调度,或者进程收到pending的信号
if ((prio == COMPACT_PRIO_ASYNC && need_resched())
|| fatal_signal_pending(current))
break;//停止执行内存压缩
}
return rc;
}
try_to_compact_pages执行过程如下:
- 调用宏for_each_zone_zonelist_nodemask遍历列表中的每一个zone,执行下面的操作:
- 如果内存碎片整理的优先级不是完全同步,并且碎片整理应该延期,则跳过这个区域,遍历下一个区域
- 调用函数compact_zone_order在该内存区域执行内存碎片整理
- 如果内存碎片整理成功,调用函数compaction_defer_reset修改zone的内存碎片整理情况成员,退出循环,返回成功。
- 内存碎片整理失败:如果压缩的优先级不是异步模式,并且压缩过了但是压缩失败或者延期,调用函数defer_compaction(zone, order)更新推迟压缩的信息。
- 如果压缩的优先级是异步模式,并且进程需要重新调度,或者进程收到pending的信号,退出循环返回失败。
compaction_deferred、compaction_defer_reset和defer_compaction函数主要是根据zone的compact_considered和compact_defer_shift成员或者修改这两个成员,来判断或者设置是否需要延期。
我们在看看compact_zone_order:
static enum compact_result compact_zone_order(struct zone *zone, int order,
gfp_t gfp_mask, enum compact_priority prio,
unsigned int alloc_flags, int highest_zoneidx,
struct page **capture)
{
enum compact_result ret;
struct compact_control cc = {
.order = order, //记录需要申请的内存order
.search_order = order, //从这个order开始查找合适的
.gfp_mask = gfp_mask,
.zone = zone, //记录从这个zone申请
.mode = (prio == COMPACT_PRIO_ASYNC) ?
MIGRATE_ASYNC : MIGRATE_SYNC_LIGHT,
.alloc_flags = alloc_flags,
.highest_zoneidx = highest_zoneidx,
.direct_compaction = true, //是否直接内存碎片整理
.whole_zone = (prio == MIN_COMPACT_PRIORITY), //是否扫描整个zone
.ignore_skip_hint = (prio == MIN_COMPACT_PRIORITY),
.ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY)
};
struct capture_control capc = {
.cc = &cc,
.page = NULL,
};
/*
* Make sure the structs are really initialized before we expose the
* capture control, in case we are interrupted and the interrupt handler
* frees a page.
*/
barrier();//内存屏障
WRITE_ONCE(current->capture_control, &capc);
ret = compact_zone(&cc, &capc);//使用compact_control压缩
VM_BUG_ON(!list_empty(&cc.freepages));
VM_BUG_ON(!list_empty(&cc.migratepages));
/*
* Make sure we hide capture control first before we read the captured
* page pointer, otherwise an interrupt could free and capture a page
* and we would leak it.
*/
WRITE_ONCE(current->capture_control, NULL);
*capture = READ_ONCE(capc.page);
return ret;
}
compact_zone_order执行流程如下:
- 申明和初始化compact_control和compact_control,
- 调用函数compact_zone进行内存碎片整理
compact_zone:
static enum compact_result
compact_zone(struct compact_control *cc, struct capture_control *capc)
{
enum compact_result ret;
unsigned long start_pfn = cc->zone->zone_start_pfn;//记录zone的起始页帧
unsigned long end_pfn = zone_end_pfn(cc->zone);//记录zone的结束页帧
unsigned long last_migrated_pfn;
const bool sync = cc->mode != MIGRATE_ASYNC;
bool update_cached;
/*
* These counters track activities during zone compaction. Initialize
* them before compacting a new zone.
*/
cc->total_migrate_scanned = 0;
cc->total_free_scanned = 0;
cc->nr_migratepages = 0;
cc->nr_freepages = 0;
INIT_LIST_HEAD(&cc->freepages);//初始化迁移目的地的空闲链表
INIT_LIST_HEAD(&cc->migratepages);//初始化将要迁移页面链表
cc->migratetype = gfp_migratetype(cc->gfp_mask);
//根据当前zone水位来判断是否适合进行内存规整,COMPACT_CONTINUE表示可以做内存规整
ret = compaction_suitable(cc->zone, cc->order, cc->alloc_flags,
cc->highest_zoneidx);
//压缩成功和跳过都说明不适合继续压缩
if (ret == COMPACT_SUCCESS || ret == COMPACT_SKIPPED)
return ret;//返回失败
/* huh, compaction_suitable is returning something unexpected */
VM_BUG_ON(ret != COMPACT_CONTINUE);
if (compaction_restarting(cc->zone, cc->order))//如果是多次失败后重新启动压缩
__reset_isolation_suitable(cc->zone);//重置之前迁移过程中设置的信息
/*
* Setup to move all movable pages to the end of the zone. Used cached
* information on where the scanners should start (unless we explicitly
* want to compact the whole zone), but check that it is initialised
* by ensuring the values are within zone boundaries.
*/
cc->fast_start_pfn = 0;
if (cc->whole_zone) {//如果是整个zone都要压缩
cc->migrate_pfn = start_pfn;
cc->free_pfn = pageblock_start_pfn(end_pfn - 1);
} else {
//表示从zone的开始页面开始扫描和查找哪些页面可以被迁移
cc->migrate_pfn = cc->zone->compact_cached_migrate_pfn[sync];
cc->free_pfn = cc->zone->compact_cached_free_pfn;
//下面对free_pfn进行范围限制
if (cc->free_pfn < start_pfn || cc->free_pfn >= end_pfn) {
cc->free_pfn = pageblock_start_pfn(end_pfn - 1);
cc->zone->compact_cached_free_pfn = cc->free_pfn;
}
//下面对migrate_pfn进行范围限制
if (cc->migrate_pfn < start_pfn || cc->migrate_pfn >= end_pfn) {
cc->migrate_pfn = start_pfn;
cc->zone->compact_cached_migrate_pfn[0] = cc->migrate_pfn;
cc->zone->compact_cached_migrate_pfn[1] = cc->migrate_pfn;
}
if (cc->migrate_pfn <= cc->zone->compact_init_migrate_pfn)
cc->whole_zone = true;
}
last_migrated_pfn = 0;
/*
* Migrate has separate cached PFNs for ASYNC and SYNC* migration on
* the basis that some migrations will fail in ASYNC mode. However,
* if the cached PFNs match and pageblocks are skipped due to having
* no isolation candidates, then the sync state does not matter.
* Until a pageblock with isolation candidates is found, keep the
* cached PFNs in sync to avoid revisiting the same blocks.
*/
update_cached = !sync &&
cc->zone->compact_cached_migrate_pfn[0] == cc->zone->compact_cached_migrate_pfn[1];
trace_mm_compaction_begin(start_pfn, cc->migrate_pfn,
cc->free_pfn, end_pfn, sync);
migrate_prep_local();
//compact_finished函数主要是结束条件判断:
//迁移扫描器和空闲扫描器相遇或者有足够大的空闲块
while ((ret = compact_finished(cc)) == COMPACT_CONTINUE) {
int err;
unsigned long start_pfn = cc->migrate_pfn;
/*
* Avoid multiple rescans which can happen if a page cannot be
* isolated (dirty/writeback in async mode) or if the migrated
* pages are being allocated before the pageblock is cleared.
* The first rescan will capture the entire pageblock for
* migration. If it fails, it'll be marked skip and scanning
* will proceed as normal.
*/
cc->rescan = false;
if (pageblock_start_pfn(last_migrated_pfn) ==
pageblock_start_pfn(start_pfn)) {
cc->rescan = true;
}
//用于扫描和查找合适迁移的页,查找到后进行页面隔离
switch (isolate_migratepages(cc)) {
case ISOLATE_ABORT://隔离失败,迁移终止
ret = COMPACT_CONTENDED;
putback_movable_pages(&cc->migratepages);//放回到合适的LRU链表中
cc->nr_migratepages = 0;
goto out;
case ISOLATE_NONE://找不到可以隔离的
if (update_cached) {
cc->zone->compact_cached_migrate_pfn[1] =
cc->zone->compact_cached_migrate_pfn[0];
}
goto check_drain;
case ISOLATE_SUCCESS://隔离成功,去迁移
update_cached = false;
last_migrated_pfn = start_pfn;
;
}
//页面迁移核心函数,从cc->migratepages中摘取页,然后尝试去迁移
err = migrate_pages(&cc->migratepages, compaction_alloc,
compaction_free, (unsigned long)cc, cc->mode,
MR_COMPACTION);
trace_mm_compaction_migratepages(cc->nr_migratepages, err,
&cc->migratepages);
/* All pages were either migrated or will be released */
cc->nr_migratepages = 0;
if (err) {//如果迁移失败
putback_movable_pages(&cc->migratepages);//放回到合适的LRU链表中
//如果是因为内存不够,并且扫描已经完成
if (err == -ENOMEM && !compact_scanners_met(cc)) {
ret = COMPACT_CONTENDED;
goto out;//返回COMPACT_CONTENDED
}
/*
* We failed to migrate at least one page in the current
* order-aligned block, so skip the rest of it.
*/
if (cc->direct_compaction &&
(cc->mode == MIGRATE_ASYNC)) {
cc->migrate_pfn = block_end_pfn(
cc->migrate_pfn - 1, cc->order);
/* Draining pcplists is useless in this case */
last_migrated_pfn = 0;
}
}
check_drain:
if (cc->order > 0 && last_migrated_pfn) {//如果隔离成功,last_migrated_pfn是隔离成功时置1的
unsigned long current_block_start =
block_start_pfn(cc->migrate_pfn, cc->order);//重置扫描块
if (last_migrated_pfn < current_block_start) {
lru_add_drain_cpu_zone(cc->zone);//把pcp的物理页放回伙伴系统中
/* No more flushing until we migrate again */
last_migrated_pfn = 0;
}
}
/* Stop if a page has been captured */
if (capc && capc->page) {//如果已捕获页面,
ret = COMPACT_SUCCESS;
break;//退出扫描隔离循环
}
}
out:
/*
* Release free pages and update where the free scanner should restart,
* so we don't leave any returned pages behind in the next attempt.
*/
if (cc->nr_freepages > 0) {
unsigned long free_pfn = release_freepages(&cc->freepages);//释放空闲页面
cc->nr_freepages = 0;//把freepages置0
VM_BUG_ON(free_pfn == 0);
/* The cached pfn is always the first in a pageblock */
free_pfn = pageblock_start_pfn(free_pfn);
/*
* Only go back, not forward. The cached pfn might have been
* already reset to zone end in compact_finished()
*/
if (free_pfn > cc->zone->compact_cached_free_pfn)
cc->zone->compact_cached_free_pfn = free_pfn;
}
count_compact_events(COMPACTMIGRATE_SCANNED, cc->total_migrate_scanned);
count_compact_events(COMPACTFREE_SCANNED, cc->total_free_scanned);
trace_mm_compaction_end(start_pfn, cc->migrate_pfn,
cc->free_pfn, end_pfn, sync, ret);
return ret;
}
compact_zone执行流程如下:
- 调用函数 compaction_suitable ,判断内存区域是否适合执行内存碎片整理。
- 设置迁移扫描器和空闲扫描器的起始物理页号。
- 调用函数 compact_finished ,判断内存碎片整理是否完成。如果完成跳到10
- 调用函数 isolate_migratepages ,隔离可移动页,把可移动页添加到迁移扫描器的可移动页链表中。
- 如果4返回隔离终止,调用函数putback_movable_pages把可移动页放回去
- 如果4返回找不到可以隔离的,去到9
- 如果4返回隔离成功,调用函数 migrate_pages ,把可移动页移到内存区域顶部的空闲页。
- 如果迁移失败,调用函数putback_movable_pages把可移动页放回去。
- 重置扫描块,如果已经拿到足够的页面,退出循环,否则去到3继续循环。
- 调用函数release_freepages释放迁移器的空闲列表多余的页面。
下面会具体分析一下几个函数:compaction_suitable、compact_finished 、isolate_migratepages 和 migrate_pages。
1.1 compaction_suitable
enum compact_result compaction_suitable(struct zone *zone, int order,
unsigned int alloc_flags,
int highest_zoneidx)
{
enum compact_result ret;
int fragindex;
//判断是否适合内存压缩的核心函数
ret = __compaction_suitable(zone, order, alloc_flags, highest_zoneidx,
zone_page_state(zone, NR_FREE_PAGES));
//如果返回适合,但是申请的内存大于3阶
if (ret == COMPACT_CONTINUE && (order > PAGE_ALLOC_COSTLY_ORDER)) {
fragindex = fragmentation_index(zone, order);//计算碎片指数
//如果碎片指数在[0 ,外部碎片的阈值 ]中
if (fragindex >= 0 && fragindex <= sysctl_extfrag_threshold)
ret = COMPACT_NOT_SUITABLE_ZONE;//返回内存不足
}
trace_mm_compaction_suitable(zone, order, ret);
//如果没有适合的区域
if (ret == COMPACT_NOT_SUITABLE_ZONE)
ret = COMPACT_SKIPPED;//返回跳过内存压缩
return ret;
}
compaction_suitable函数主要是调用__compaction_suitable判断是否适合内存压缩的,compaction_suitable:
static enum compact_result __compaction_suitable(struct zone *zone, int order,
unsigned int alloc_flags,
int highest_zoneidx,
unsigned long wmark_target)
{
unsigned long watermark;
//order为-1:表示内存碎片整理是由管理员通过文件“ /proc/sys/vm/compact_memory ”触发的,
if (is_via_compact_memory(order))//如果参数 order 是 −1 ,
return COMPACT_CONTINUE;//内存区域强制执行内存碎片整理。
//获取调用者允许的水位线
watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
//如果水位线足够高
if (zone_watermark_ok(zone, order, watermark, highest_zoneidx,
alloc_flags))
return COMPACT_SUCCESS;//说明有一大块内存,不需要压缩,返回成功
//如果大于3阶的内存申请,使用低水位线
watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ?
low_wmark_pages(zone) : min_wmark_pages(zone);
watermark += compact_gap(order);//水位线加上2倍的内存申请量,作为水位线
//如果水位线不够
if (!__zone_watermark_ok(zone, 0, watermark, highest_zoneidx,
ALLOC_CMA, wmark_target))
return COMPACT_SKIPPED;//返回跳过,不进行内存压缩
return COMPACT_CONTINUE;//返回需要内存碎片整理
}
__compaction_suitable执行流程如下:
- 如果传入参数order是-1,内存区域强制执行内存碎片整理,返回继续
- 设置调用者允许的水位线,调用zone_watermark_ok判断水位线。如果水位线很高,说明有一大块内存,不需要压缩,返回成功。
- 如果大于3阶的内存申请,使用低水位线,小于3阶的内存申请使用最低水位线,水位线加上2倍的内存申请量,作为水位线,调用zone_watermark_ok判断水位线。如果水位线很低,说明内存严重不足,返回失败。
- 上面的情况都不是则返回继续。
1.2 compact_finished
static enum compact_result compact_finished(struct compact_control *cc)
{
int ret;
ret = __compact_finished(cc);
trace_mm_compaction_finished(cc->zone, cc->order, ret);
if (ret == COMPACT_NO_SUITABLE_PAGE)
ret = COMPACT_CONTINUE;
return ret;
}
static enum compact_result __compact_finished(struct compact_control *cc)
{
unsigned int order;
const int migratetype = cc->migratetype;
int ret;
if (compact_scanners_met(cc)) {//如果迁移扫描器和空闲扫描器相遇,
reset_cached_positions(cc->zone);//下一次内存碎片整理从头开始
if (cc->direct_compaction)//如果是直接内存碎片赠礼
cc->zone->compact_blockskip_flush = true;
if (cc->whole_zone)
return COMPACT_COMPLETE;//返回整理完成了全部
else
return COMPACT_PARTIAL_SKIPPED;//返回完成了部分
}
if (cc->proactive_compaction) {//如果是kcompaction主动进行整理
int score, wmark_low;
pg_data_t *pgdat;
pgdat = cc->zone->zone_pgdat;
if (kswapd_is_running(pgdat))//如果kswap也在跑
return COMPACT_PARTIAL_SKIPPED;//返回完成了部分
score = fragmentation_score_zone(cc->zone);//计算分区碎片分数
wmark_low = fragmentation_score_wmark(pgdat, true);//计算水位的碎片分数
if (score > wmark_low)//如果分区碎片分数大于碎片分数
ret = COMPACT_CONTINUE;//返回继续
else
ret = COMPACT_SUCCESS;//返回分配成功
goto out;//返回
}
if (is_via_compact_memory(cc->order))//如果内存碎片整理是管理员执行命令触发的
return COMPACT_CONTINUE;//返回继续执行内存碎片整理
//如果没有对齐,说明这个块没有完成,
if (!IS_ALIGNED(cc->migrate_pfn, pageblock_nr_pages))
return COMPACT_CONTINUE;//需要继续
ret = COMPACT_NO_SUITABLE_PAGE;
//遍历zone的每一个free_area,检查是否存在足够大的空闲页块
for (order = cc->order; order < MAX_ORDER; order++) {
struct free_area *area = &cc->zone->free_area[order];
bool can_steal;
//如果申请的迁移类型存在足够大的空闲页块,
if (!free_area_empty(area, migratetype))
return COMPACT_SUCCESS;//内存碎片整理成功,返回成功
#ifdef CONFIG_CMA
if (migratetype == MIGRATE_MOVABLE && //如果申请可移动类型的页
!free_area_empty(area, MIGRATE_CMA))//CMA类型存在足够大的空闲页块
return COMPACT_SUCCESS;//内存碎片整理成功,返回成功
#endif
//如果备用的迁移类型存在足够大的空闲页块
if (find_suitable_fallback(area, order, migratetype,
true, &can_steal) != -1) {
if (migratetype == MIGRATE_MOVABLE)//如果申请的类型是可移动类型
return COMPACT_SUCCESS;//内存碎片整理成功,返回成功
if (cc->mode == MIGRATE_ASYNC || //如果内存碎片整理的优先级是异步模式
IS_ALIGNED(cc->migrate_pfn, //如果当前页块处理完了
pageblock_nr_pages)) {
return COMPACT_SUCCESS;//内存碎片整理成功,返回成功
}
ret = COMPACT_CONTINUE;//当前页块没有处理完,那么继续执行内存碎片整理
break;
}
}
out:
//如果竞争锁或者需要重新调度进程,或者进程收到致命的信号
if (cc->contended || fatal_signal_pending(current))
ret = COMPACT_CONTENDED;//提前终止内存碎片整理
return ret;
}
compact_finished 执行流程如下:
- 调用函数compact_scanners_met判断迁移扫描器和空闲扫描器是否相遇,如果相遇,调用函数reset_cached_positions设置下一次内存碎片整理开始位置,返回完成了。
- 如果是kcompaction主动进行整理,如果kswap也在跑返回完成了部分,否则计算分区碎片分数和水位的碎片分数,根据情况返回完成还是继续。
- 如果内存碎片整理是管理员执行命令触发的,返回继续执行内存碎片整理
- 遍历zone的每一个free_area,检查是否存在足够大的空闲页块,执行下面的流程:
- 如果申请的迁移类型存在足够大的空闲页块,内存碎片整理成功,返回成功
- 如果申请可移动类型的页,并且CMA类型存在足够大的空闲页块,内存碎片整理成功,返回成功
- 如果备用的迁移类型存在足够大的空闲页块:如果申请的类型是可移动类型,内存碎片整理成功,返回成功。如果内存碎片整理的优先级是异步模式,并且当前块处理完了,内存碎片整理成功,返回成功。都不成立,说明当前页块没有处理完,那么继续执行内存碎片整理,返回继续。
- 如果竞争锁或者需要重新调度进程,或者进程收到致命的信号,提前终止内存碎片整理,返回终止。
1.3 isolate_migratepages
static isolate_migrate_t isolate_migratepages(struct compact_control *cc)
{
unsigned long block_start_pfn;
unsigned long block_end_pfn;
unsigned long low_pfn;
struct page *page;
const isolate_mode_t isolate_mode =
(sysctl_compact_unevictable_allowed ? ISOLATE_UNEVICTABLE : 0) |
(cc->mode != MIGRATE_SYNC ? ISOLATE_ASYNC_MIGRATE : 0);
bool fast_find_block;
low_pfn = fast_find_migrateblock(cc);//上次停止扫描的物理页号,从这个物理页号继续扫描
block_start_pfn = pageblock_start_pfn(low_pfn);//物理页号low_pfn所属页块的第一页
if (block_start_pfn < cc->zone->zone_start_pfn)
block_start_pfn = cc->zone->zone_start_pfn;
//跳过标记过的页块
fast_find_block = low_pfn != cc->migrate_pfn && !cc->fast_search_fail;
/* Only scan within a pageblock boundary */
block_end_pfn = pageblock_end_pfn(low_pfn);//物理页号low_pfn所属页块下一个页块的第一页。
//遍历整个页块,直到和空闲扫描器相遇
for (; block_end_pfn <= cc->free_pfn;
fast_find_block = false,
low_pfn = block_end_pfn,
block_start_pfn = block_end_pfn,
block_end_pfn += pageblock_nr_pages) {
//每扫描完32个页块,
if (!(low_pfn % (SWAP_CLUSTER_MAX * pageblock_nr_pages)))
cond_resched();//检查是否需要把进程调度出去
//检查页块的第一页和最后一页是不是都属于当前内存区域
page = pageblock_pfn_to_page(block_start_pfn,
block_end_pfn, cc->zone);
if (!page)//如果不是
continue;//下一个页帧
if (IS_ALIGNED(low_pfn, pageblock_nr_pages) && //如果页块对齐,
!fast_find_block && !isolation_suitable(cc, page))//页块因为上次隔离失败被标记为跳过,
continue;//跳过这个页块
if (!suitable_migration_source(cc, page)) {//如果页块不适合迁移
update_cached_migrate(cc, block_end_pfn);
continue;//下一个页块
}
//隔离页块里面的可移动页,放入cc的空闲链表中
low_pfn = isolate_migratepages_block(cc, low_pfn,
block_end_pfn, isolate_mode);
if (!low_pfn)//隔离失败
return ISOLATE_ABORT;//返回终止
/*
* Either we isolated something and proceed with migration. Or
* we failed and compact_zone should decide if we should
* continue or not.
*/
break;//隔离成功,我们也要退出循环,因为函数外面还有个循环,我们会再次进来的
}
//记录迁移扫描器停止的物理页号,下次从这个物理页号继续扫描
cc->migrate_pfn = low_pfn;
return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
isolate_migratepages 执行流程如下:
- 调用函数fast_find_migrateblock找到上次停止扫描的物理页号,从这个物理页号继续扫描,记录物理页号low_pfn所属页块的第一页和最后一页。
- 遍历整个页帧,直到和空闲扫描器相遇,在循环体里面执行:
- 每扫描完32个页块,调用函数cond_resched检查是否需要把进程调度出去
- 调用函数pageblock_pfn_to_page检查页块的第一页和最后一页是不是都属于当前内存区域,如果不是,下一个页块
- 调用函数suitable_migration_source判断页块是否适合迁移,如果不适合,下一个页块
- 来到这里说明适合迁移,调用函数isolate_migratepages_block隔离页块里面的可移动页,放入cc的迁移链表中,如果隔离失败,返回终止
- 如果隔离成功,我们也要退出循环,因为函数外面还有个循环,我们会再次进来的。
- 记录迁移扫描器停止的物理页号,下次从这个物理页号继续扫描,如果迁移数量大于0,返回成功,否则返回隔离不到。
我们继续看看isolate_migratepages_block是怎么隔离一个页块里面的可移动页的:
static unsigned long
isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
unsigned long end_pfn, isolate_mode_t isolate_mode)
{
pg_data_t *pgdat = cc->zone->zone_pgdat;
unsigned long nr_scanned = 0, nr_isolated = 0;
struct lruvec *lruvec;
unsigned long flags = 0;
bool locked = false;
struct page *page = NULL, *valid_page = NULL;
unsigned long start_pfn = low_pfn;
bool skip_on_failure = false;
unsigned long next_skip_pfn = 0;
bool skip_updated = false;
while (unlikely(too_many_isolated(pgdat))) {//有太多的隔离页
if (cc->nr_migratepages)//如果没有未迁移的页面,
return 0;//则停止隔离
if (cc->mode == MIGRATE_ASYNC)//如果是异步迁移
return 0;//则停止隔离
congestion_wait(BLK_RW_ASYNC, HZ/10);
if (fatal_signal_pending(current))
return 0;
}
cond_resched();
//如果是直接内存碎片整理,并且模式是异步的
if (cc->direct_compaction && (cc->mode == MIGRATE_ASYNC)) {
skip_on_failure = true;//开启失败就跳过
next_skip_pfn = block_end_pfn(low_pfn, cc->order);//设置跳过的页帧条件
}
//遍历每一个物理页
for (; low_pfn < end_pfn; low_pfn++) {
//如果开启了失败就跳过,并且到达了跳过的页帧条件
if (skip_on_failure && low_pfn >= next_skip_pfn) {
if (nr_isolated)//如果已经隔离了一些页面
break;//退出循环
next_skip_pfn = block_end_pfn(low_pfn, cc->order);//更新跳过的页帧条件
}
if (!(low_pfn % SWAP_CLUSTER_MAX)
&& compact_unlock_should_abort(&pgdat->lru_lock,
flags, &locked, cc)) {//周期性解锁,如果有一个致命的信号
low_pfn = 0;
goto fatal_pending;//返回终止
}
if (!pfn_valid_within(low_pfn))//如果页帧无效
goto isolate_fail;//返回失败
nr_scanned++;
page = pfn_to_page(low_pfn);//根据页帧找到page
if (!valid_page && IS_ALIGNED(low_pfn, pageblock_nr_pages)) {//如果页面有效并且对齐了
if (!cc->ignore_skip_hint && get_pageblock_skip(page)) {//如果页面标记已经跳过
low_pfn = end_pfn;
goto isolate_abort;//返回终止
}
valid_page = page;
}
if (PageBuddy(page)) {//如果页属于页分配器,说明页是空闲的,那么跳过这个页。
unsigned long freepage_order = buddy_order_unsafe(page);
if (freepage_order > 0 && freepage_order < MAX_ORDER)
low_pfn += (1UL << freepage_order) - 1;
continue;
}
//如果页是复合页,例如透明巨型页或 hugetlbfs 巨型页
if (PageCompound(page) && !cc->alloc_contig) {
const unsigned int order = compound_order(page);
if (likely(order < MAX_ORDER))
low_pfn += (1UL << order) - 1;
goto isolate_fail;//隔离失败,返回
}
if (!PageLRU(page)) {//如果是 非LRU页
if (unlikely(__PageMovable(page)) && //如果是可移动页
!PageIsolated(page)) { //该也没有被隔离
if (locked) {
spin_unlock_irqrestore(&pgdat->lru_lock,
flags);
locked = false;
}
//调用函数isolate_movable_page 隔离这个页
if (!isolate_movable_page(page, isolate_mode))
goto isolate_success;//隔离成功,返回
}
goto isolate_fail;//隔离失败,返回
}
if (!page_mapping(page) && //如果是匿名页
page_count(page) > page_mapcount(page)) //引用计数大于映射计数
goto isolate_fail;//说明内核的某个地方正在访问这个匿名页,隔离失败,返回
//如果是文件页,但是调用者不允许调用文件系统的接口
if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
goto isolate_fail;//不能隔离,返回失败
if (!locked) {//如果没有持有锁
locked = compact_lock_irqsave(&pgdat->lru_lock,
&flags, cc);//申请锁
/* Try get exclusive access under lock */
if (!skip_updated) {
skip_updated = true;
//设置页面的块跳过位,如果成功
if (test_and_set_skip(cc, page, low_pfn))
goto isolate_abort;//终止
}
if (!PageLRU(page))//重新判断页是否在LRU链表中
goto isolate_fail;
//重新判断页是否是复合页
if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
low_pfn += compound_nr(page) - 1;//跳过整个复合页
goto isolate_fail;
}
}
lruvec = mem_cgroup_page_lruvec(page, pgdat);
//调用函数 __isolate_lru_page 隔离页
if (__isolate_lru_page(page, isolate_mode) != 0)
goto isolate_fail;
/* The whole page is taken off the LRU; skip the tail pages. */
if (PageCompound(page))//如果是复合页
low_pfn += compound_nr(page) - 1;//跳过整个复合页
//隔离成功了,把页从 LRU 链表中删除
del_page_from_lru_list(page, lruvec, page_lru(page));
mod_node_page_state(page_pgdat(page),
NR_ISOLATED_ANON + page_is_file_lru(page),
thp_nr_pages(page));
isolate_success:
//把页添加到迁移扫描器的可移动页链表中
list_add(&page->lru, &cc->migratepages);
cc->nr_migratepages += compound_nr(page);
nr_isolated += compound_nr(page);
//如果已经隔离了 32 页,
if (cc->nr_migratepages >= COMPACT_CLUSTER_MAX &&
!cc->rescan && !cc->contended) {
++low_pfn;
break;//停止隔离
}
continue;
isolate_fail:
if (!skip_on_failure)
continue;
/*
* We have isolated some pages, but then failed. Release them
* instead of migrating, as we cannot form the cc->order buddy
* page anyway.
*/
if (nr_isolated) {
if (locked) {
spin_unlock_irqrestore(&pgdat->lru_lock, flags);
locked = false;
}
//迁移扫描器的可移动页链表中的页面释放回去
putback_movable_pages(&cc->migratepages);
cc->nr_migratepages = 0;
nr_isolated = 0;
}
if (low_pfn < next_skip_pfn) {
low_pfn = next_skip_pfn - 1;
/*
* The check near the loop beginning would have updated
* next_skip_pfn too, but this is a bit simpler.
*/
next_skip_pfn += 1UL << cc->order;
}
}
/*
* The PageBuddy() check could have potentially brought us outside
* the range to be scanned.
*/
if (unlikely(low_pfn > end_pfn))
low_pfn = end_pfn;
isolate_abort:
if (locked)
spin_unlock_irqrestore(&pgdat->lru_lock, flags);
/*
* Updated the cached scanner pfn once the pageblock has been scanned
* Pages will either be migrated in which case there is no point
* scanning in the near future or migration failed in which case the
* failure reason may persist. The block is marked for skipping if
* there were no pages isolated in the block or if the block is
* rescanned twice in a row.
*/
if (low_pfn == end_pfn && (!nr_isolated || cc->rescan)) {
if (valid_page && !skip_updated)
set_pageblock_skip(valid_page);
update_cached_migrate(cc, low_pfn);
}
trace_mm_compaction_isolate_migratepages(start_pfn, low_pfn,
nr_scanned, nr_isolated);
fatal_pending:
cc->total_migrate_scanned += nr_scanned;
if (nr_isolated)
count_compact_events(COMPACTISOLATED, nr_isolated);
return low_pfn;
}
isolate_migratepages_block执行流程如下:
- 如果有太多的隔离页:当前还没有开始迁移页面,或者是异步迁移,或者当前进程有信号要处理,返回0。
- 如果是直接内存碎片整理,并且模式是异步的,开启失败就跳过然后设置跳过的页帧条件。
- for循环遍历块中的每一个物理页,在循环中执行以下操作:
- 如果开启了失败就跳过,并且到达了跳过的页帧条件,还已经隔离了一些页面,退出循环。
- 如果页帧无效,返回失败
- 如果页属于页分配器,说明页是空闲的,那么跳过这个页。
- 如果页是复合页,例如透明巨型页或 hugetlbfs 巨型页,隔离失败,返回。
- 如果是 非LRU可移动页,并且没有被隔离,调用函数isolate_movable_page隔离这个页,
- 如果是匿名页并且引用计数大于映射计数,说明内核的某个地方正在访问这个匿名页,隔离失败,返回
- 如果是文件页,但是调用者不允许调用文件系统的接口,不能隔离,返回失败
- 调用函数 __isolate_lru_page 隔离页,成功则调用函数del_page_from_lru_list把页从 LRU 链表中删除,失败则返回
- 把页添加到迁移扫描器的可移动页链表中,
- 如果已经隔离了 32 页,退出循环,否则回到3继续循环
- 隔离失败的处理:调用函数putback_movable_pages把迁移扫描器的可移动页链表中的页面释放回去
1.4 migrate_pages
函数使用实例:
migrate_pages(&cc->migratepages,compaction_alloc,compaction_free,(unsigned long)cc, cc->mode,MR_COMPACTION);
函数作用:将列表中指定的页迁移到作为页迁移目标提供的空闲页。
第一个参数是迁移的列表;
第二个参数是用于分配空闲页面作为页面迁移目标的函数;
第三个参数用于分配空闲页面作为页面迁移目标的函数;
第四个参数是传递给第二个参数的私有数据
第五个参数是指定页面迁移约束的迁移模式,
第六个参数是页面迁移的原因。
int migrate_pages(struct list_head *from, new_page_t get_new_page,
free_page_t put_new_page, unsigned long private,
enum migrate_mode mode, int reason)
{
int retry = 1;
int thp_retry = 1;
int nr_failed = 0;
int nr_succeeded = 0;
int nr_thp_succeeded = 0;
int nr_thp_failed = 0;
int nr_thp_split = 0;
int pass = 0;
bool is_thp = false;
struct page *page;
struct page *page2;
int swapwrite = current->flags & PF_SWAPWRITE;
int rc, nr_subpages;
if (!swapwrite)
current->flags |= PF_SWAPWRITE;
//尝试10次后返回
for (pass = 0; pass < 10 && (retry || thp_retry); pass++) {
retry = 0;
thp_retry = 0;
//遍历每一个页
list_for_each_entry_safe(page, page2, from, lru) {
retry:
/*
* THP statistics is based on the source huge page.
* Capture required information that might get lost
* during migration.
*/
is_thp = PageTransHuge(page) && !PageHuge(page);
nr_subpages = thp_nr_pages(page);
cond_resched();
if (PageHuge(page))//如果是hugetlbfs巨型页
//把可移动页移到空闲页
rc = unmap_and_move_huge_page(get_new_page,
put_new_page, private, page,
pass > 2, mode, reason);
else//否则就是普通页或透明巨型页
//把可移动页移到空闲页
rc = unmap_and_move(get_new_page, put_new_page,
private, page, pass > 2, mode,
reason);
switch(rc) {
case -ENOMEM:
/*
* THP migration might be unsupported or the
* allocation could've failed so we should
* retry on the same page with the THP split
* to base pages.
*
* Head page is retried immediately and tail
* pages are added to the tail of the list so
* we encounter them after the rest of the list
* is processed.
*/
if (is_thp) {
lock_page(page);
rc = split_huge_page_to_list(page, from);
unlock_page(page);
if (!rc) {
list_safe_reset_next(page, page2, lru);
nr_thp_split++;
goto retry;
}
nr_thp_failed++;
nr_failed += nr_subpages;
goto out;
}
nr_failed++;
goto out;
case -EAGAIN:
if (is_thp) {
thp_retry++;
break;
}
retry++;
break;
case MIGRATEPAGE_SUCCESS:
if (is_thp) {
nr_thp_succeeded++;
nr_succeeded += nr_subpages;
break;
}
nr_succeeded++;
break;
default:
/*
* Permanent failure (-EBUSY, -ENOSYS, etc.):
* unlike -EAGAIN case, the failed page is
* removed from migration page list and not
* retried in the next outer loop.
*/
if (is_thp) {
nr_thp_failed++;
nr_failed += nr_subpages;
break;
}
nr_failed++;
break;
}
}
}
nr_failed += retry + thp_retry;
nr_thp_failed += thp_retry;
rc = nr_failed;
out:
count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
count_vm_events(PGMIGRATE_FAIL, nr_failed);
count_vm_events(THP_MIGRATION_SUCCESS, nr_thp_succeeded);
count_vm_events(THP_MIGRATION_FAIL, nr_thp_failed);
count_vm_events(THP_MIGRATION_SPLIT, nr_thp_split);
trace_mm_migrate_pages(nr_succeeded, nr_failed, nr_thp_succeeded,
nr_thp_failed, nr_thp_split, mode, reason);
if (!swapwrite)
current->flags &= ~PF_SWAPWRITE;
return rc;
}
migrate_pages执行流程如下:
- 使用for循环,尝试10次后返回,在循环体中执行以下操作:
- 使用list_for_each_entry_safe遍历第一个参数的每一个页,在循环体中执行以下操作:
- 如果是hugetlbfs巨型页,调用函数unmap_and_move_huge_page把可移动页移到空闲页,如果是普通页或透明巨型页,调用函数unmap_and_move把可移动页移到空闲页。
- 各种移动错误的处理。
其实unmap_and_move_huge_page和unmap_and_move是差不多的,我们就只看unmap_and_move这个平时使用更加频繁的这个吧:
static int unmap_and_move(new_page_t get_new_page,
free_page_t put_new_page,
unsigned long private, struct page *page,
int force, enum migrate_mode mode,
enum migrate_reason reason)
{
int rc = MIGRATEPAGE_SUCCESS;
struct page *newpage = NULL;
if (!thp_migration_supported() && PageTransHuge(page))
return -ENOMEM;
if (page_count(page) == 1) {
/* page was freed from under us. So we are done. */
ClearPageActive(page);
ClearPageUnevictable(page);
if (unlikely(__PageMovable(page))) {
lock_page(page);
if (!PageMovable(page))
__ClearPageIsolated(page);
unlock_page(page);
}
goto out;
}
//从空闲扫描器的空闲页链表中取一个空闲页,
//如果空闲页链表是空的,空闲扫描器扫描空闲页并将之添加到空闲页链表中。
newpage = get_new_page(page, private);
if (!newpage)
return -ENOMEM;
rc = __unmap_and_move(page, newpage, force, mode);//可移动页移到空闲页
if (rc == MIGRATEPAGE_SUCCESS)//启动成功
set_page_owner_migrate_reason(newpage, reason);//设置迁移的原因
out:
if (rc != -EAGAIN) {//重试
/*
* A page that has been migrated has all references
* removed and will be freed. A page that has not been
* migrated will have kept its references and be restored.
*/
list_del(&page->lru);
/*
* Compaction can migrate also non-LRU pages which are
* not accounted to NR_ISOLATED_*. They can be recognized
* as __PageMovable
*/
if (likely(!__PageMovable(page)))//如果是可移动页
mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
page_is_file_lru(page), -thp_nr_pages(page));
}
/*
* If migration is successful, releases reference grabbed during
* isolation. Otherwise, restore the page to right list unless
* we want to retry.
*/
if (rc == MIGRATEPAGE_SUCCESS) {//如果迁移成功
if (reason != MR_MEMORY_FAILURE)
/*
* We release the page in page_handle_poison.
*/
put_page(page);//释放page的引用,引用计数减一
} else {
if (rc != -EAGAIN) {//如果迁移失败,需要重试
if (likely(!__PageMovable(page))) {//如果是可移动页
putback_lru_page(page);
goto put_new;
}
lock_page(page);
if (PageMovable(page))
putback_movable_page(page);
else
__ClearPageIsolated(page);
unlock_page(page);
put_page(page);
}
put_new:
if (put_new_page)
put_new_page(newpage, private);
else
put_page(newpage);
}
return rc;
}
unmap_and_move执行流程如下:
- 调用函数get_new_page,这个函数是个传入参数,真正调用的是compaction_alloc,作用从空闲扫描器的空闲页链表中取一个空闲页,如果空闲页链表是空的,空闲扫描器扫描空闲页并将之添加到空闲页链表中。
- 调用函数__unmap_and_move把可移动页移到空闲页
- 如果迁移成功,释放page的引用,引用计数减一
- 如果失败,调用函数put_new_page,这个函数是个传入参数,真正调用的是compaction_free,作用是把步骤1得到的新页放回空闲扫描器的空闲页链表中。
接下来,我们需要看看着几个函数__unmap_and_move、compaction_alloc和compaction_free。
1.4.1 __unmap_and_move
static int __unmap_and_move(struct page *page, struct page *newpage,
int force, enum migrate_mode mode)
{
int rc = -EAGAIN;
int page_was_mapped = 0;
struct anon_vma *anon_vma = NULL;
bool is_lru = !__PageMovable(page);//__PageMovable返回真就表示页是非LRU可移动页
if (!trylock_page(page)) {//尝试锁住旧的物理页。如果失败
if (!force || mode == MIGRATE_ASYNC)//如果其他进程持有锁
goto out; //返回失败
if (current->flags & PF_MEMALLOC)//当前进程正在申请内存
goto out;//避免死锁,返回
lock_page(page);//加锁
}
if (PageWriteback(page)) {//如果页正在回写到存储设备
switch (mode) {
case MIGRATE_SYNC:
case MIGRATE_SYNC_NO_COPY:
break;
default:
rc = -EBUSY;
goto out_unlock;
}
if (!force)//没有强制迁移
goto out_unlock;//返回
wait_on_page_writeback(page);//等待页回写完成
}
if (PageAnon(page) && !PageKsm(page))//如果是匿名页但不是内核共享页
anon_vma = page_get_anon_vma(page);//获取page的anon_vma
if (unlikely(!trylock_page(newpage)))//加锁
goto out_unlock;
if (unlikely(!is_lru)) {//如果是非LRU可移动页
rc = move_to_new_page(newpage, page, mode);//把页移动到新页
goto out_unlock_both;
}
//这里就是LRU可移动页
if (!page->mapping) {//如果页没有映射
VM_BUG_ON_PAGE(PageAnon(page), page);
if (page_has_private(page)) {//如果页有私有内容,说明是文件页缓存
try_to_free_buffers(page);//释放页缓存
goto out_unlock_both;
}
} else if (page_mapped(page)) {//如果页有映射
VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma,
page);
//从进程的页表中删除虚拟页到这个物理页的映射
try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK);
page_was_mapped = 1;//记录页被映射了
}
if (!page_mapped(page))//删除映射成功
rc = move_to_new_page(newpage, page, mode);//把页移动到新页
if (page_was_mapped)//如果页之前是被映射的
remove_migration_ptes(page, //把页表项设置为特殊的迁移页表项
rc == MIGRATEPAGE_SUCCESS ? newpage : page, false);
out_unlock_both:
unlock_page(newpage);//解锁新的物理页
out_unlock:
/* Drop an anon_vma reference if we took one */
if (anon_vma)
put_anon_vma(anon_vma);
unlock_page(page);//解锁旧的物理页
out:
if (rc == MIGRATEPAGE_SUCCESS) {//如果迁移成功
if (unlikely(!is_lru))//如果是非LRU可移动页
put_page(newpage);//新的物理页引用计数减一
else//如果是LRU可移动页
putback_lru_page(newpage);//把新的物理页插入LRU链表
}
return rc;
}
__unmap_and_move执行流程如下:
- 如果页正在回写到存储设备,迁移模式是异步或者轻量级同步,调用函数wait_on_page_writeback等待页回写完成。
- 如果是匿名页但不是内核共享页,获取page的anon_vma
- 如果是非LRU可移动页,调用函数move_to_new_page把页移动到新页
- 如果是LRU可移动页:如果页没有映射,如果页有私有内容,说明是文件页缓存,调用函数try_to_free_buffers释放页缓存。
- 如果是LRU可移动页:如果页有映射,调用函数try_to_unmap从进程的页表中删除虚拟页到这个物理页的映射。如果删除映射成功,调用函数move_to_new_page把页移动到新页。
- 如果页之前是被映射的,调用函数remove_migration_ptes把页表项设置为特殊的迁移页表项
- 如果迁移成功:如果是非LRU可移动页,新的物理页引用计数减一;否则就是LRU可移动页,调用函数putback_lru_page把新的物理页插入LRU链表。
1.4.1.1 move_to_new_page
static int move_to_new_page(struct page *newpage, struct page *page,
enum migrate_mode mode)
{
struct address_space *mapping;
int rc = -EAGAIN;
bool is_lru = !__PageMovable(page);//如果是lru可移动页
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
mapping = page_mapping(page);//获取page的address_space
if (likely(is_lru)) {//如果是lru
if (!mapping)//如果是匿名页
rc = migrate_page(mapping, newpage, page, mode);//迁移单个lru页面
else if (mapping->a_ops->migratepage)//如果mapping有migratepage方法
/*
* Most pages have a mapping and most filesystems
* provide a migratepage callback. Anonymous pages
* are part of swap space which also has its own
* migratepage callback. This is the most common path
* for page migration.
*/
rc = mapping->a_ops->migratepage(mapping, newpage,
page, mode);//调用mapping的migratepage方法
else
rc = fallback_migrate_page(mapping, newpage,
page, mode);//默认的页面迁移操作
} else {//否则是非lru
VM_BUG_ON_PAGE(!PageIsolated(page), page);
if (!PageMovable(page)) {//如果page有isolate_page方法
rc = MIGRATEPAGE_SUCCESS;
__ClearPageIsolated(page);//清理隔离标志位
goto out;//返回
}
//如果page没有isolate_page方法,并且回调migratepage方法
rc = mapping->a_ops->migratepage(mapping, newpage,
page, mode);
WARN_ON_ONCE(rc == MIGRATEPAGE_SUCCESS &&
!PageIsolated(page));
}
/*
* When successful, old pagecache page->mapping must be cleared before
* page is freed; but stats require that PageAnon be left as PageAnon.
*/
if (rc == MIGRATEPAGE_SUCCESS) {
if (__PageMovable(page)) {
VM_BUG_ON_PAGE(!PageIsolated(page), page);
/*
* We clear PG_movable under page_lock so any compactor
* cannot try to migrate this page.
*/
__ClearPageIsolated(page);
}
/*
* Anonymous and movable page->mapping will be cleared by
* free_pages_prepare so don't reset it here for keeping
* the type to work PageAnon, for example.
*/
if (!PageMappingFlags(page))
page->mapping = NULL;//清除旧页的映射
if (likely(!is_zone_device_page(newpage))) {//如果不是设备内存
int i, nr = compound_nr(newpage);
for (i = 0; i < nr; i++)
flush_dcache_page(newpage + i);//清除旧页的cache
}
}
out:
return rc;
}
move_to_new_page执行流程如下:
- 如果是lru:如果是匿名页,调用函数migrate_page迁移单个lru页面
- 如果是lru:如果是文件页,并且文件系统有migratepage回调方法,调用migratepage方法。
- 如果是lru:如果是文件页,并且文件系统没有migratepage回调方法,调用函数fallback_migrate_page迁移。
- 如果是非lru:如果page有isolate_page方法,清理隔离标志位后返回成功
- 如果是非lru:如果page没有isolate_page方法,回调migratepage方法。
- 如果迁移成功,清除旧页的映射,如果不是设备内存,还要清除cache中旧页的数据。
1.4.1.1.1 migrate_page
int migrate_page(struct address_space *mapping,
struct page *newpage, struct page *page,
enum migrate_mode mode)
{
int rc;
BUG_ON(PageWriteback(page)); /* Writeback must be complete */
//替换映射中的页面,就要是page结构体的内容更新
rc = migrate_page_move_mapping(mapping, newpage, page, 0);
if (rc != MIGRATEPAGE_SUCCESS)
return rc;
if (mode != MIGRATE_SYNC_NO_COPY)
migrate_page_copy(newpage, page);//将page页的内容复制的newpage
else
migrate_page_states(newpage, page);//将page页的状态复制的newpage
return MIGRATEPAGE_SUCCESS;
}
EXPORT_SYMBOL(migrate_page);
migrate_page执行流程如下:
- 调用函数migrate_page_move_mapping替换映射中的页面,就要是page结构体的内容更新
- 如果迁移模式不是MIGRATE_SYNC_NO_COPY,调用函数migrate_page_copy将page页的内容复制的newpage
- 如果迁移模式是MIGRATE_SYNC_NO_COPY,调用函数migrate_page_states将page页的状态复制的newpage
migrate_page_move_mapping函数主要是把page结构体成员更新到新页中,包括index、mapping和各种标志位,比如脏。migrate_page_copy函数可以看看:
void migrate_page_copy(struct page *newpage, struct page *page)
{
if (PageHuge(page) || PageTransHuge(page))
copy_huge_page(newpage, page);//把page的内容复制到newpage
else
copy_highpage(newpage, page);//把page的内容复制到newpage
migrate_page_states(newpage, page);//将page页的状态复制的newpage
}
migrate_page_copy函数主要是判断page的类型,然后调用copy_huge_page或者copy_highpage把page的内容复制到newpage,其实copy_huge_page也是调用到copy_highpage函数的,而copy_highpage函数就是简单的调用memcpy进行内存复制而已。
1.4.1.1.2 fallback_migrate_page
static int fallback_migrate_page(struct address_space *mapping,
struct page *newpage, struct page *page, enum migrate_mode mode)
{
if (PageDirty(page)) {//如果是脏页
switch (mode) {
case MIGRATE_SYNC:
case MIGRATE_SYNC_NO_COPY:
break;
default:
return -EBUSY;
}
return writeout(mapping, page);//回写一页以清除脏状态
}
if (page_has_private(page) && //如果有私有数据
!try_to_release_page(page, GFP_KERNEL))//并且没有办法释放
return mode == MIGRATE_SYNC ? -EAGAIN : -EBUSY;//返回错误
return migrate_page(mapping, newpage, page, mode);//迁移单个页面
}
fallback_migrate_page执行流程如下:
- 如果是脏页,并且是完全同步模式,调用函数writeout回写一页以清除脏状态后返回。
- 如果是脏页,但不是完全同步模式,返回错误
- 如果有私有数据,并且没有办法释放私有数据,返回错误
- 调用函数migrate_page迁移单个页面。
1.4.1.2 remove_migration_ptes
void remove_migration_ptes(struct page *old, struct page *new, bool locked)
{
struct rmap_walk_control rwc = {
每获取一个vma就会调用一次此函数,
.rmap_one = remove_migration_pte,//将一个潜在的迁移pte恢复到一个工作pte条目
.arg = old,//remove_migration_pte的最后一个参数
};
if (locked)
rmap_walk_locked(new, &rwc);//反向映射遍历vma函数
else
rmap_walk(new, &rwc);//反向映射遍历vma函数
}
remove_migration_ptes主要工作是初始化rmap_walk_control 结构体,然后调用函数rmap_walk反向映射遍历vma函数。rmap_walk:
void rmap_walk(struct page *page, struct rmap_walk_control *rwc)
{
if (unlikely(PageKsm(page)))
rmap_walk_ksm(page, rwc);
else if (PageAnon(page))
rmap_walk_anon(page, rwc, false);
else
rmap_walk_file(page, rwc, false);
}
如果是内核共享页,调用函数rmap_walk_ksm;如果是匿名页,调用函数rmap_walk_anon;否则就是文件页了,调用rmap_walk_file。
其实这几个函数都是差不多的,主要差异在于文件页和匿名页的vma是不一样的,要通过不同的方法遍历vma而已,最终都是调用rwc的rmap_one 成员,也就是remove_migration_pte函数。好吧这个函数目前我也看不懂。只知道他的作用:本来vma映射的是旧的page,这个函数让这个vma映射到了新的page上。
1.4.2 compaction_alloc
static struct page *compaction_alloc(struct page *migratepage,
unsigned long data)
{
struct compact_control *cc = (struct compact_control *)data;
struct page *freepage;
if (list_empty(&cc->freepages)) {
isolate_freepages(cc);
if (list_empty(&cc->freepages))
return NULL;
}
freepage = list_entry(cc->freepages.next, struct page, lru);
list_del(&freepage->lru);
cc->nr_freepages--;
return freepage;
}
compaction_alloc执行流程如下:
- 如果扫面器的空闲列表为空,就调用函数isolate_freepages扫描,把伙伴系统中的空闲页面放入扫面器的空闲列表中,如果扫描完了,扫面器的空闲列表还是空,返回NULL
- 在扫面器的空闲列表中取出一个空闲页面,并且返回这个page。
isolate_freepages函数和前面的isolate_migratepages函数是差不多的,只是有两个不同而已:一个是隔离空闲页和隔离可迁移页的区别,另一个是扫描的方向是相反的。仅此而已,这里也不多描述了。
1.4.3 compaction_free
static void compaction_free(struct page *page, unsigned long data)
{
struct compact_control *cc = (struct compact_control *)data;
list_add(&page->lru, &cc->freepages);
cc->nr_freepages++;
}
compaction_free很简单,就是把cc->freepages放回page->lru中。