文章目录

  • 1. __alloc_pages_direct_compact
  • 1.1 compaction_suitable
  • 1.2 compact_finished
  • 1.3 isolate_migratepages
  • 1.4 migrate_pages
  • 1.4.1 __unmap_and_move
  • 1.4.1.1 move_to_new_page
  • 1.4.1.1.1 migrate_page
  • 1.4.1.1.2 fallback_migrate_page
  • 1.4.1.2 remove_migration_ptes
  • 1.4.2 compaction_alloc
  • 1.4.3 compaction_free



内存碎片整理( memory compaction ,直译为“内存压缩”,意译为“内存碎片整理”)的基本思想是:从内存区域的底部扫描已分配的可移动页,从内存区域的顶部扫描空闲页,把底部的可移动页移到顶部的空闲页,在底部形成连续的空闲页。


内存碎片整理的算法如下:

  1. 首先从内存区域的底部向顶部以页块为单位扫描,在页块内部从起始页向结束页扫描,把这个页块里面的可移动页组成一条链表。
  2. 然后从内存区域的顶部向底部以页块为单位扫描,在页块内部也是从起始页向结束页扫描,把空闲页组成一条链表。
  3. 最后把底部的可移动页的数据复制到顶部的空闲页,修改进程的页表,把虚拟页映射到新的物理页。

内存碎片整理有 3 种优先级,从高到低依次如下所示:

  1. COMPACT_PRIO_SYNC_FULL :完全同步模式,允许阻塞,允许把脏的文件页回写到存储设备上,并且等待回写完成。
  2. COMPACT_PRIO_SYNC_LIGHT :轻量级同步模式,允许大多数操作阻塞,但是不允许把脏的文件页回写到存储设备上。
  3. COMPACT_PRIO_ASYNC :异步模式,不允许阻塞。

完全同步模式的成本最高,轻量级同步模式的成本其次,异步模式的成本最低。在慢速路径的内存申请过程中,如果条件满足,第一次调用__alloc_pages_direct_compact是使用成本最低的异步模式。第二次也是使用异步模式,如果第一次调钱不满足,第二次会使用同步的方式调用内存碎片整理。compact_priority参数决定优先级的,可以自己回头看看代码。

1. __alloc_pages_direct_compact

这里开始看看__alloc_pages_direct_compact是怎么进行内存碎片整理的:

static struct page *
__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
		unsigned int alloc_flags, const struct alloc_context *ac,
		enum compact_priority prio, enum compact_result *compact_result)
{
	struct page *page = NULL;
	unsigned long pflags;
	unsigned int noreclaim_flag;

	if (!order)
		return NULL;

	psi_memstall_enter(&pflags);//空函数
	//设置current的flags为PF_MEMALLOC,表示在内存碎片整理进行中
	noreclaim_flag = memalloc_noreclaim_save();

	//尝试压缩页面
	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
								prio, &page);

	memalloc_noreclaim_restore(noreclaim_flag);//恢复current的flags,表示compact结束
	psi_memstall_leave(&pflags);//空函数

	count_vm_event(COMPACTSTALL);//统计压缩事件数量加一

	if (page)//如果压缩页面获取了一些page
		//清除page的一些标志等信息
		prep_new_page(page, order, gfp_mask, alloc_flags);

	if (!page)//如果压缩页面失败
		//尝试从伙伴系统的空闲列表中分配物理内存
		page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);

	if (page) {//如果分配到内存
		struct zone *zone = page_zone(page);//申请到的物理内存的zone

		zone->compact_blockskip_flush = false;
		compaction_defer_reset(zone, order, true);//压缩延期时间重置
		count_vm_event(COMPACTSUCCESS);//压缩事件次数加一
		return page;
	}

	/*
	 * It's bad if compaction run occurs and fails. The most likely reason
	 * is that pages exist, but not enough to satisfy watermarks.
	 */
	count_vm_event(COMPACTFAIL);//压缩事件次数加一

	cond_resched();//睡眠,调度出去,等待页面回收后继续慢速路径内存申请

	return NULL;
}

__alloc_pages_direct_compact执行步骤如下:

  1. 直接调用函数try_to_compact_pages尝试压缩页面
  2. 如果压缩获取到一些page,调用函数prep_new_page清除这些page的标志状态,然后跳到4,
  3. 如果压缩失败,调用函数get_page_from_freelist尝试从伙伴系统中分配内存,
  4. 如果步骤2或者步骤3获取到内存,调用函数compaction_defer_reset重置zone的内存碎片整理相关的成员,
  5. 如果获取不到内存,调用函数cond_resched调度出去,等待页面回收后继续慢速路径内存申请,返回NULL

只有try_to_compact_pages函数值得我们继续看,其他都是小东西:

enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
		unsigned int alloc_flags, const struct alloc_context *ac,
		enum compact_priority prio, struct page **capture)
{
	int may_perform_io = gfp_mask & __GFP_IO;
	struct zoneref *z;
	struct zone *zone;
	enum compact_result rc = COMPACT_SKIPPED;

	if (!may_perform_io)//如果调用者不允许写存储设备
		return COMPACT_SKIPPED;//不执行内存碎片整理

	trace_mm_compaction_try_to_compact_pages(order, gfp_mask, prio);

	//遍历所有的zone
	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist,
					ac->highest_zoneidx, ac->nodemask) {
		enum compact_result status;

		if (prio > MIN_COMPACT_PRIORITY//如果内存碎片整理的优先级不是完全同步,
					&& compaction_deferred(zone, order)) {//这个zone的碎片整理应该延期
			rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
			continue;//该内存区域不执行内存碎片整理,下一个区域
		}

		//在该内存区域执行内存碎片整理
		status = compact_zone_order(zone, order, gfp_mask, prio,
				alloc_flags, ac->highest_zoneidx, capture);
		rc = max(status, rc);

		/* The allocation should succeed, stop compacting */
		if (status == COMPACT_SUCCESS) {//如果压缩成功
			compaction_defer_reset(zone, order, false);//修改压缩标志位

			break;//退出循环,停止压缩
		}

		//如果压缩的优先级不是异步模式,并且压缩过了但是压缩失败或者延期
		if (prio != COMPACT_PRIO_ASYNC && (status == COMPACT_COMPLETE ||
					status == COMPACT_PARTIAL_SKIPPED))
			defer_compaction(zone, order);//更新推迟压缩的信息

		//如果压缩的优先级是异步模式,并且进程需要重新调度,或者进程收到pending的信号
		if ((prio == COMPACT_PRIO_ASYNC && need_resched())
					|| fatal_signal_pending(current))
			break;//停止执行内存压缩
	}

	return rc;
}

try_to_compact_pages执行过程如下:

  1. 调用宏for_each_zone_zonelist_nodemask遍历列表中的每一个zone,执行下面的操作:
  2. 如果内存碎片整理的优先级不是完全同步,并且碎片整理应该延期,则跳过这个区域,遍历下一个区域
  3. 调用函数compact_zone_order在该内存区域执行内存碎片整理
  4. 如果内存碎片整理成功,调用函数compaction_defer_reset修改zone的内存碎片整理情况成员,退出循环,返回成功。
  5. 内存碎片整理失败:如果压缩的优先级不是异步模式,并且压缩过了但是压缩失败或者延期,调用函数defer_compaction(zone, order)更新推迟压缩的信息。
  6. 如果压缩的优先级是异步模式,并且进程需要重新调度,或者进程收到pending的信号,退出循环返回失败。

compaction_deferred、compaction_defer_reset和defer_compaction函数主要是根据zone的compact_considered和compact_defer_shift成员或者修改这两个成员,来判断或者设置是否需要延期。
我们在看看compact_zone_order:

static enum compact_result compact_zone_order(struct zone *zone, int order,
		gfp_t gfp_mask, enum compact_priority prio,
		unsigned int alloc_flags, int highest_zoneidx,
		struct page **capture)
{
	enum compact_result ret;
	struct compact_control cc = {
		.order = order,	//记录需要申请的内存order
		.search_order = order,	//从这个order开始查找合适的
		.gfp_mask = gfp_mask,
		.zone = zone,	//记录从这个zone申请
		.mode = (prio == COMPACT_PRIO_ASYNC) ?
					MIGRATE_ASYNC :	MIGRATE_SYNC_LIGHT,
		.alloc_flags = alloc_flags,
		.highest_zoneidx = highest_zoneidx,
		.direct_compaction = true,	//是否直接内存碎片整理
		.whole_zone = (prio == MIN_COMPACT_PRIORITY),	//是否扫描整个zone
		.ignore_skip_hint = (prio == MIN_COMPACT_PRIORITY),
		.ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY)
	};
	struct capture_control capc = {
		.cc = &cc,
		.page = NULL,
	};

	/*
	 * Make sure the structs are really initialized before we expose the
	 * capture control, in case we are interrupted and the interrupt handler
	 * frees a page.
	 */
	barrier();//内存屏障
	WRITE_ONCE(current->capture_control, &capc);

	ret = compact_zone(&cc, &capc);//使用compact_control压缩

	VM_BUG_ON(!list_empty(&cc.freepages));
	VM_BUG_ON(!list_empty(&cc.migratepages));

	/*
	 * Make sure we hide capture control first before we read the captured
	 * page pointer, otherwise an interrupt could free and capture a page
	 * and we would leak it.
	 */
	WRITE_ONCE(current->capture_control, NULL);
	*capture = READ_ONCE(capc.page);

	return ret;
}

compact_zone_order执行流程如下:

  1. 申明和初始化compact_control和compact_control,
  2. 调用函数compact_zone进行内存碎片整理

compact_zone:

static enum compact_result
compact_zone(struct compact_control *cc, struct capture_control *capc)
{
	enum compact_result ret;
	unsigned long start_pfn = cc->zone->zone_start_pfn;//记录zone的起始页帧
	unsigned long end_pfn = zone_end_pfn(cc->zone);//记录zone的结束页帧
	unsigned long last_migrated_pfn;
	const bool sync = cc->mode != MIGRATE_ASYNC;
	bool update_cached;

	/*
	 * These counters track activities during zone compaction.  Initialize
	 * them before compacting a new zone.
	 */
	cc->total_migrate_scanned = 0;
	cc->total_free_scanned = 0;
	cc->nr_migratepages = 0;
	cc->nr_freepages = 0;
	INIT_LIST_HEAD(&cc->freepages);//初始化迁移目的地的空闲链表
	INIT_LIST_HEAD(&cc->migratepages);//初始化将要迁移页面链表

	cc->migratetype = gfp_migratetype(cc->gfp_mask);
	//根据当前zone水位来判断是否适合进行内存规整,COMPACT_CONTINUE表示可以做内存规整
	ret = compaction_suitable(cc->zone, cc->order, cc->alloc_flags,
							cc->highest_zoneidx);
	//压缩成功和跳过都说明不适合继续压缩
	if (ret == COMPACT_SUCCESS || ret == COMPACT_SKIPPED)
		return ret;//返回失败

	/* huh, compaction_suitable is returning something unexpected */
	VM_BUG_ON(ret != COMPACT_CONTINUE);

	if (compaction_restarting(cc->zone, cc->order))//如果是多次失败后重新启动压缩
		__reset_isolation_suitable(cc->zone);//重置之前迁移过程中设置的信息

	/*
	 * Setup to move all movable pages to the end of the zone. Used cached
	 * information on where the scanners should start (unless we explicitly
	 * want to compact the whole zone), but check that it is initialised
	 * by ensuring the values are within zone boundaries.
	 */
	cc->fast_start_pfn = 0;
	if (cc->whole_zone) {//如果是整个zone都要压缩
		cc->migrate_pfn = start_pfn;
		cc->free_pfn = pageblock_start_pfn(end_pfn - 1);
	} else {
		//表示从zone的开始页面开始扫描和查找哪些页面可以被迁移
		cc->migrate_pfn = cc->zone->compact_cached_migrate_pfn[sync];
		cc->free_pfn = cc->zone->compact_cached_free_pfn;
		//下面对free_pfn进行范围限制
		if (cc->free_pfn < start_pfn || cc->free_pfn >= end_pfn) {
			cc->free_pfn = pageblock_start_pfn(end_pfn - 1);
			cc->zone->compact_cached_free_pfn = cc->free_pfn;
		}
		//下面对migrate_pfn进行范围限制
		if (cc->migrate_pfn < start_pfn || cc->migrate_pfn >= end_pfn) {
			cc->migrate_pfn = start_pfn;
			cc->zone->compact_cached_migrate_pfn[0] = cc->migrate_pfn;
			cc->zone->compact_cached_migrate_pfn[1] = cc->migrate_pfn;
		}

		if (cc->migrate_pfn <= cc->zone->compact_init_migrate_pfn)
			cc->whole_zone = true;
	}

	last_migrated_pfn = 0;

	/*
	 * Migrate has separate cached PFNs for ASYNC and SYNC* migration on
	 * the basis that some migrations will fail in ASYNC mode. However,
	 * if the cached PFNs match and pageblocks are skipped due to having
	 * no isolation candidates, then the sync state does not matter.
	 * Until a pageblock with isolation candidates is found, keep the
	 * cached PFNs in sync to avoid revisiting the same blocks.
	 */
	update_cached = !sync &&
		cc->zone->compact_cached_migrate_pfn[0] == cc->zone->compact_cached_migrate_pfn[1];

	trace_mm_compaction_begin(start_pfn, cc->migrate_pfn,
				cc->free_pfn, end_pfn, sync);

	migrate_prep_local();

	//compact_finished函数主要是结束条件判断:
	//迁移扫描器和空闲扫描器相遇或者有足够大的空闲块
	while ((ret = compact_finished(cc)) == COMPACT_CONTINUE) {
		int err;
		unsigned long start_pfn = cc->migrate_pfn;

		/*
		 * Avoid multiple rescans which can happen if a page cannot be
		 * isolated (dirty/writeback in async mode) or if the migrated
		 * pages are being allocated before the pageblock is cleared.
		 * The first rescan will capture the entire pageblock for
		 * migration. If it fails, it'll be marked skip and scanning
		 * will proceed as normal.
		 */
		cc->rescan = false;
		if (pageblock_start_pfn(last_migrated_pfn) ==
		    pageblock_start_pfn(start_pfn)) {
			cc->rescan = true;
		}

		//用于扫描和查找合适迁移的页,查找到后进行页面隔离
		switch (isolate_migratepages(cc)) {
		case ISOLATE_ABORT://隔离失败,迁移终止
			ret = COMPACT_CONTENDED;
			putback_movable_pages(&cc->migratepages);//放回到合适的LRU链表中
			cc->nr_migratepages = 0;
			goto out;
		case ISOLATE_NONE://找不到可以隔离的
			if (update_cached) {
				cc->zone->compact_cached_migrate_pfn[1] =
					cc->zone->compact_cached_migrate_pfn[0];
			}
			goto check_drain;
		case ISOLATE_SUCCESS://隔离成功,去迁移
			update_cached = false;
			last_migrated_pfn = start_pfn;
			;
		}

		//页面迁移核心函数,从cc->migratepages中摘取页,然后尝试去迁移
		err = migrate_pages(&cc->migratepages, compaction_alloc,
				compaction_free, (unsigned long)cc, cc->mode,
				MR_COMPACTION);

		trace_mm_compaction_migratepages(cc->nr_migratepages, err,
							&cc->migratepages);

		/* All pages were either migrated or will be released */
		cc->nr_migratepages = 0;
		if (err) {//如果迁移失败
			putback_movable_pages(&cc->migratepages);//放回到合适的LRU链表中
			//如果是因为内存不够,并且扫描已经完成
			if (err == -ENOMEM && !compact_scanners_met(cc)) {
				ret = COMPACT_CONTENDED;
				goto out;//返回COMPACT_CONTENDED
			}
			/*
			 * We failed to migrate at least one page in the current
			 * order-aligned block, so skip the rest of it.
			 */
			if (cc->direct_compaction &&
						(cc->mode == MIGRATE_ASYNC)) {
				cc->migrate_pfn = block_end_pfn(
						cc->migrate_pfn - 1, cc->order);
				/* Draining pcplists is useless in this case */
				last_migrated_pfn = 0;
			}
		}

check_drain:
		if (cc->order > 0 && last_migrated_pfn) {//如果隔离成功,last_migrated_pfn是隔离成功时置1的
			unsigned long current_block_start =
				block_start_pfn(cc->migrate_pfn, cc->order);//重置扫描块

			if (last_migrated_pfn < current_block_start) {
				lru_add_drain_cpu_zone(cc->zone);//把pcp的物理页放回伙伴系统中
				/* No more flushing until we migrate again */
				last_migrated_pfn = 0;
			}
		}

		/* Stop if a page has been captured */
		if (capc && capc->page) {//如果已捕获页面,
			ret = COMPACT_SUCCESS;
			break;//退出扫描隔离循环
		}
	}

out:
	/*
	 * Release free pages and update where the free scanner should restart,
	 * so we don't leave any returned pages behind in the next attempt.
	 */
	if (cc->nr_freepages > 0) {
		unsigned long free_pfn = release_freepages(&cc->freepages);//释放空闲页面

		cc->nr_freepages = 0;//把freepages置0
		VM_BUG_ON(free_pfn == 0);
		/* The cached pfn is always the first in a pageblock */
		free_pfn = pageblock_start_pfn(free_pfn);
		/*
		 * Only go back, not forward. The cached pfn might have been
		 * already reset to zone end in compact_finished()
		 */
		if (free_pfn > cc->zone->compact_cached_free_pfn)
			cc->zone->compact_cached_free_pfn = free_pfn;
	}

	count_compact_events(COMPACTMIGRATE_SCANNED, cc->total_migrate_scanned);
	count_compact_events(COMPACTFREE_SCANNED, cc->total_free_scanned);

	trace_mm_compaction_end(start_pfn, cc->migrate_pfn,
				cc->free_pfn, end_pfn, sync, ret);

	return ret;
}

compact_zone执行流程如下:

  1. 调用函数 compaction_suitable ,判断内存区域是否适合执行内存碎片整理。
  2. 设置迁移扫描器和空闲扫描器的起始物理页号。
  3. 调用函数 compact_finished ,判断内存碎片整理是否完成。如果完成跳到10
  4. 调用函数 isolate_migratepages ,隔离可移动页,把可移动页添加到迁移扫描器的可移动页链表中。
  5. 如果4返回隔离终止,调用函数putback_movable_pages把可移动页放回去
  6. 如果4返回找不到可以隔离的,去到9
  7. 如果4返回隔离成功,调用函数 migrate_pages ,把可移动页移到内存区域顶部的空闲页。
  8. 如果迁移失败,调用函数putback_movable_pages把可移动页放回去。
  9. 重置扫描块,如果已经拿到足够的页面,退出循环,否则去到3继续循环。
  10. 调用函数release_freepages释放迁移器的空闲列表多余的页面。

下面会具体分析一下几个函数:compaction_suitable、compact_finished 、isolate_migratepages 和 migrate_pages。

1.1 compaction_suitable

enum compact_result compaction_suitable(struct zone *zone, int order,
					unsigned int alloc_flags,
					int highest_zoneidx)
{
	enum compact_result ret;
	int fragindex;

	//判断是否适合内存压缩的核心函数
	ret = __compaction_suitable(zone, order, alloc_flags, highest_zoneidx,
				    zone_page_state(zone, NR_FREE_PAGES));

	//如果返回适合,但是申请的内存大于3阶
	if (ret == COMPACT_CONTINUE && (order > PAGE_ALLOC_COSTLY_ORDER)) {
		fragindex = fragmentation_index(zone, order);//计算碎片指数
		//如果碎片指数在[0 ,外部碎片的阈值 ]中
		if (fragindex >= 0 && fragindex <= sysctl_extfrag_threshold)
			ret = COMPACT_NOT_SUITABLE_ZONE;//返回内存不足
	}

	trace_mm_compaction_suitable(zone, order, ret);
	//如果没有适合的区域
	if (ret == COMPACT_NOT_SUITABLE_ZONE)
		ret = COMPACT_SKIPPED;//返回跳过内存压缩

	return ret;
}

compaction_suitable函数主要是调用__compaction_suitable判断是否适合内存压缩的,compaction_suitable:

static enum compact_result __compaction_suitable(struct zone *zone, int order,
					unsigned int alloc_flags,
					int highest_zoneidx,
					unsigned long wmark_target)
{
	unsigned long watermark;

	//order为-1:表示内存碎片整理是由管理员通过文件“ /proc/sys/vm/compact_memory ”触发的,
	if (is_via_compact_memory(order))//如果参数 order 是 −1 ,
		return COMPACT_CONTINUE;//内存区域强制执行内存碎片整理。

	//获取调用者允许的水位线
	watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
	//如果水位线足够高
	if (zone_watermark_ok(zone, order, watermark, highest_zoneidx,
								alloc_flags))
		return COMPACT_SUCCESS;//说明有一大块内存,不需要压缩,返回成功

	
	//如果大于3阶的内存申请,使用低水位线
	watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ?
				low_wmark_pages(zone) : min_wmark_pages(zone);
	watermark += compact_gap(order);//水位线加上2倍的内存申请量,作为水位线
	//如果水位线不够
	if (!__zone_watermark_ok(zone, 0, watermark, highest_zoneidx,
						ALLOC_CMA, wmark_target))
		return COMPACT_SKIPPED;//返回跳过,不进行内存压缩

	return COMPACT_CONTINUE;//返回需要内存碎片整理
}

__compaction_suitable执行流程如下:

  1. 如果传入参数order是-1,内存区域强制执行内存碎片整理,返回继续
  2. 设置调用者允许的水位线,调用zone_watermark_ok判断水位线。如果水位线很高,说明有一大块内存,不需要压缩,返回成功。
  3. 如果大于3阶的内存申请,使用低水位线,小于3阶的内存申请使用最低水位线,水位线加上2倍的内存申请量,作为水位线,调用zone_watermark_ok判断水位线。如果水位线很低,说明内存严重不足,返回失败。
  4. 上面的情况都不是则返回继续。

1.2 compact_finished

static enum compact_result compact_finished(struct compact_control *cc)
{
	int ret;

	ret = __compact_finished(cc);
	trace_mm_compaction_finished(cc->zone, cc->order, ret);
	if (ret == COMPACT_NO_SUITABLE_PAGE)
		ret = COMPACT_CONTINUE;

	return ret;
}

static enum compact_result __compact_finished(struct compact_control *cc)
{
	unsigned int order;
	const int migratetype = cc->migratetype;
	int ret;

	if (compact_scanners_met(cc)) {//如果迁移扫描器和空闲扫描器相遇,
		reset_cached_positions(cc->zone);//下一次内存碎片整理从头开始

		if (cc->direct_compaction)//如果是直接内存碎片赠礼
			cc->zone->compact_blockskip_flush = true;

		if (cc->whole_zone)
			return COMPACT_COMPLETE;//返回整理完成了全部
		else
			return COMPACT_PARTIAL_SKIPPED;//返回完成了部分
	}

	if (cc->proactive_compaction) {//如果是kcompaction主动进行整理
		int score, wmark_low;
		pg_data_t *pgdat;

		pgdat = cc->zone->zone_pgdat;
		if (kswapd_is_running(pgdat))//如果kswap也在跑
			return COMPACT_PARTIAL_SKIPPED;//返回完成了部分

		score = fragmentation_score_zone(cc->zone);//计算分区碎片分数
		wmark_low = fragmentation_score_wmark(pgdat, true);//计算水位的碎片分数

		if (score > wmark_low)//如果分区碎片分数大于碎片分数
			ret = COMPACT_CONTINUE;//返回继续
		else
			ret = COMPACT_SUCCESS;//返回分配成功

		goto out;//返回
	}

	if (is_via_compact_memory(cc->order))//如果内存碎片整理是管理员执行命令触发的
		return COMPACT_CONTINUE;//返回继续执行内存碎片整理

	//如果没有对齐,说明这个块没有完成,
	if (!IS_ALIGNED(cc->migrate_pfn, pageblock_nr_pages))
		return COMPACT_CONTINUE;//需要继续

	ret = COMPACT_NO_SUITABLE_PAGE;
	//遍历zone的每一个free_area,检查是否存在足够大的空闲页块
	for (order = cc->order; order < MAX_ORDER; order++) {
		struct free_area *area = &cc->zone->free_area[order];
		bool can_steal;

		//如果申请的迁移类型存在足够大的空闲页块,
		if (!free_area_empty(area, migratetype))
			return COMPACT_SUCCESS;//内存碎片整理成功,返回成功

#ifdef CONFIG_CMA
		if (migratetype == MIGRATE_MOVABLE &&	//如果申请可移动类型的页
			!free_area_empty(area, MIGRATE_CMA))//CMA类型存在足够大的空闲页块
			return COMPACT_SUCCESS;//内存碎片整理成功,返回成功
#endif
		//如果备用的迁移类型存在足够大的空闲页块
		if (find_suitable_fallback(area, order, migratetype,
						true, &can_steal) != -1) {

			if (migratetype == MIGRATE_MOVABLE)//如果申请的类型是可移动类型
				return COMPACT_SUCCESS;//内存碎片整理成功,返回成功

			if (cc->mode == MIGRATE_ASYNC ||	//如果内存碎片整理的优先级是异步模式
					IS_ALIGNED(cc->migrate_pfn,	//如果当前页块处理完了
							pageblock_nr_pages)) {
				return COMPACT_SUCCESS;//内存碎片整理成功,返回成功
			}

			ret = COMPACT_CONTINUE;//当前页块没有处理完,那么继续执行内存碎片整理
			break;
		}
	}

out:
	//如果竞争锁或者需要重新调度进程,或者进程收到致命的信号
	if (cc->contended || fatal_signal_pending(current))
		ret = COMPACT_CONTENDED;//提前终止内存碎片整理

	return ret;
}

compact_finished 执行流程如下:

  1. 调用函数compact_scanners_met判断迁移扫描器和空闲扫描器是否相遇,如果相遇,调用函数reset_cached_positions设置下一次内存碎片整理开始位置,返回完成了。
  2. 如果是kcompaction主动进行整理,如果kswap也在跑返回完成了部分,否则计算分区碎片分数和水位的碎片分数,根据情况返回完成还是继续。
  3. 如果内存碎片整理是管理员执行命令触发的,返回继续执行内存碎片整理
  4. 遍历zone的每一个free_area,检查是否存在足够大的空闲页块,执行下面的流程:
  5. 如果申请的迁移类型存在足够大的空闲页块,内存碎片整理成功,返回成功
  6. 如果申请可移动类型的页,并且CMA类型存在足够大的空闲页块,内存碎片整理成功,返回成功
  7. 如果备用的迁移类型存在足够大的空闲页块:如果申请的类型是可移动类型,内存碎片整理成功,返回成功。如果内存碎片整理的优先级是异步模式,并且当前块处理完了,内存碎片整理成功,返回成功。都不成立,说明当前页块没有处理完,那么继续执行内存碎片整理,返回继续。
  8. 如果竞争锁或者需要重新调度进程,或者进程收到致命的信号,提前终止内存碎片整理,返回终止。

1.3 isolate_migratepages

static isolate_migrate_t isolate_migratepages(struct compact_control *cc)
{
	unsigned long block_start_pfn;
	unsigned long block_end_pfn;
	unsigned long low_pfn;
	struct page *page;
	const isolate_mode_t isolate_mode =
		(sysctl_compact_unevictable_allowed ? ISOLATE_UNEVICTABLE : 0) |
		(cc->mode != MIGRATE_SYNC ? ISOLATE_ASYNC_MIGRATE : 0);
	bool fast_find_block;

	low_pfn = fast_find_migrateblock(cc);//上次停止扫描的物理页号,从这个物理页号继续扫描
	block_start_pfn = pageblock_start_pfn(low_pfn);//物理页号low_pfn所属页块的第一页
	if (block_start_pfn < cc->zone->zone_start_pfn)
		block_start_pfn = cc->zone->zone_start_pfn;

	//跳过标记过的页块
	fast_find_block = low_pfn != cc->migrate_pfn && !cc->fast_search_fail;

	/* Only scan within a pageblock boundary */
	block_end_pfn = pageblock_end_pfn(low_pfn);//物理页号low_pfn所属页块下一个页块的第一页。

	//遍历整个页块,直到和空闲扫描器相遇
	for (; block_end_pfn <= cc->free_pfn;
			fast_find_block = false,
			low_pfn = block_end_pfn,
			block_start_pfn = block_end_pfn,
			block_end_pfn += pageblock_nr_pages) {

		//每扫描完32个页块,
		if (!(low_pfn % (SWAP_CLUSTER_MAX * pageblock_nr_pages)))
			cond_resched();//检查是否需要把进程调度出去

		//检查页块的第一页和最后一页是不是都属于当前内存区域
		page = pageblock_pfn_to_page(block_start_pfn,
						block_end_pfn, cc->zone);
		if (!page)//如果不是
			continue;//下一个页帧

		if (IS_ALIGNED(low_pfn, pageblock_nr_pages) &&	//如果页块对齐,
		    !fast_find_block && !isolation_suitable(cc, page))//页块因为上次隔离失败被标记为跳过,
			continue;//跳过这个页块

		if (!suitable_migration_source(cc, page)) {//如果页块不适合迁移
			update_cached_migrate(cc, block_end_pfn);
			continue;//下一个页块
		}

		//隔离页块里面的可移动页,放入cc的空闲链表中
		low_pfn = isolate_migratepages_block(cc, low_pfn,
						block_end_pfn, isolate_mode);

		if (!low_pfn)//隔离失败
			return ISOLATE_ABORT;//返回终止

		/*
		 * Either we isolated something and proceed with migration. Or
		 * we failed and compact_zone should decide if we should
		 * continue or not.
		 */
		break;//隔离成功,我们也要退出循环,因为函数外面还有个循环,我们会再次进来的
	}

	//记录迁移扫描器停止的物理页号,下次从这个物理页号继续扫描
	cc->migrate_pfn = low_pfn;

	return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;

isolate_migratepages 执行流程如下:

  1. 调用函数fast_find_migrateblock找到上次停止扫描的物理页号,从这个物理页号继续扫描,记录物理页号low_pfn所属页块的第一页和最后一页。
  2. 遍历整个页帧,直到和空闲扫描器相遇,在循环体里面执行:
  3. 每扫描完32个页块,调用函数cond_resched检查是否需要把进程调度出去
  4. 调用函数pageblock_pfn_to_page检查页块的第一页和最后一页是不是都属于当前内存区域,如果不是,下一个页块
  5. 调用函数suitable_migration_source判断页块是否适合迁移,如果不适合,下一个页块
  6. 来到这里说明适合迁移,调用函数isolate_migratepages_block隔离页块里面的可移动页,放入cc的迁移链表中,如果隔离失败,返回终止
  7. 如果隔离成功,我们也要退出循环,因为函数外面还有个循环,我们会再次进来的。
  8. 记录迁移扫描器停止的物理页号,下次从这个物理页号继续扫描,如果迁移数量大于0,返回成功,否则返回隔离不到。

我们继续看看isolate_migratepages_block是怎么隔离一个页块里面的可移动页的:

static unsigned long
isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
			unsigned long end_pfn, isolate_mode_t isolate_mode)
{
	pg_data_t *pgdat = cc->zone->zone_pgdat;
	unsigned long nr_scanned = 0, nr_isolated = 0;
	struct lruvec *lruvec;
	unsigned long flags = 0;
	bool locked = false;
	struct page *page = NULL, *valid_page = NULL;
	unsigned long start_pfn = low_pfn;
	bool skip_on_failure = false;
	unsigned long next_skip_pfn = 0;
	bool skip_updated = false;

	while (unlikely(too_many_isolated(pgdat))) {//有太多的隔离页
		if (cc->nr_migratepages)//如果没有未迁移的页面,
			return 0;//则停止隔离

		if (cc->mode == MIGRATE_ASYNC)//如果是异步迁移
			return 0;//则停止隔离

		congestion_wait(BLK_RW_ASYNC, HZ/10);

		if (fatal_signal_pending(current))
			return 0;
	}

	cond_resched();
	//如果是直接内存碎片整理,并且模式是异步的
	if (cc->direct_compaction && (cc->mode == MIGRATE_ASYNC)) {
		skip_on_failure = true;//开启失败就跳过
		next_skip_pfn = block_end_pfn(low_pfn, cc->order);//设置跳过的页帧条件
	}

	//遍历每一个物理页
	for (; low_pfn < end_pfn; low_pfn++) {
		//如果开启了失败就跳过,并且到达了跳过的页帧条件
		if (skip_on_failure && low_pfn >= next_skip_pfn) {
			if (nr_isolated)//如果已经隔离了一些页面
				break;//退出循环

			next_skip_pfn = block_end_pfn(low_pfn, cc->order);//更新跳过的页帧条件
		}

		if (!(low_pfn % SWAP_CLUSTER_MAX)
		    && compact_unlock_should_abort(&pgdat->lru_lock,
					    flags, &locked, cc)) {//周期性解锁,如果有一个致命的信号
			low_pfn = 0;
			goto fatal_pending;//返回终止
		}

		if (!pfn_valid_within(low_pfn))//如果页帧无效
			goto isolate_fail;//返回失败
		nr_scanned++;

		page = pfn_to_page(low_pfn);//根据页帧找到page

		if (!valid_page && IS_ALIGNED(low_pfn, pageblock_nr_pages)) {//如果页面有效并且对齐了
			if (!cc->ignore_skip_hint && get_pageblock_skip(page)) {//如果页面标记已经跳过
				low_pfn = end_pfn;
				goto isolate_abort;//返回终止
			}
			valid_page = page;
		}

		if (PageBuddy(page)) {//如果页属于页分配器,说明页是空闲的,那么跳过这个页。
			unsigned long freepage_order = buddy_order_unsafe(page);
			if (freepage_order > 0 && freepage_order < MAX_ORDER)
				low_pfn += (1UL << freepage_order) - 1;
			continue;
		}

		//如果页是复合页,例如透明巨型页或 hugetlbfs 巨型页
		if (PageCompound(page) && !cc->alloc_contig) {
			const unsigned int order = compound_order(page);

			if (likely(order < MAX_ORDER))
				low_pfn += (1UL << order) - 1;
			goto isolate_fail;//隔离失败,返回
		}

		if (!PageLRU(page)) {//如果是 非LRU页
			if (unlikely(__PageMovable(page)) &&	//如果是可移动页
					!PageIsolated(page)) {	//该也没有被隔离
				if (locked) {
					spin_unlock_irqrestore(&pgdat->lru_lock,
									flags);
					locked = false;
				}
				//调用函数isolate_movable_page 隔离这个页
				if (!isolate_movable_page(page, isolate_mode))
					goto isolate_success;//隔离成功,返回
			}

			goto isolate_fail;//隔离失败,返回
		}

		if (!page_mapping(page) &&	//如果是匿名页
		    page_count(page) > page_mapcount(page))	//引用计数大于映射计数
			goto isolate_fail;//说明内核的某个地方正在访问这个匿名页,隔离失败,返回

		//如果是文件页,但是调用者不允许调用文件系统的接口
		if (!(cc->gfp_mask & __GFP_FS) && page_mapping(page))
			goto isolate_fail;//不能隔离,返回失败

		if (!locked) {//如果没有持有锁
			locked = compact_lock_irqsave(&pgdat->lru_lock,
								&flags, cc);//申请锁

			/* Try get exclusive access under lock */
			if (!skip_updated) {
				skip_updated = true;
				//设置页面的块跳过位,如果成功
				if (test_and_set_skip(cc, page, low_pfn))
					goto isolate_abort;//终止
			}

			if (!PageLRU(page))//重新判断页是否在LRU链表中
				goto isolate_fail;

			//重新判断页是否是复合页
			if (unlikely(PageCompound(page) && !cc->alloc_contig)) {
				low_pfn += compound_nr(page) - 1;//跳过整个复合页
				goto isolate_fail;
			}
		}

		lruvec = mem_cgroup_page_lruvec(page, pgdat);

		//调用函数 __isolate_lru_page 隔离页
		if (__isolate_lru_page(page, isolate_mode) != 0)
			goto isolate_fail;

		/* The whole page is taken off the LRU; skip the tail pages. */
		if (PageCompound(page))//如果是复合页
			low_pfn += compound_nr(page) - 1;//跳过整个复合页

		//隔离成功了,把页从 LRU 链表中删除
		del_page_from_lru_list(page, lruvec, page_lru(page));
		mod_node_page_state(page_pgdat(page),
				NR_ISOLATED_ANON + page_is_file_lru(page),
				thp_nr_pages(page));

isolate_success:
		//把页添加到迁移扫描器的可移动页链表中
		list_add(&page->lru, &cc->migratepages);
		cc->nr_migratepages += compound_nr(page);
		nr_isolated += compound_nr(page);

		//如果已经隔离了 32 页,
		if (cc->nr_migratepages >= COMPACT_CLUSTER_MAX &&
		    !cc->rescan && !cc->contended) {
			++low_pfn;
			break;//停止隔离
		}

		continue;
isolate_fail:
		if (!skip_on_failure)
			continue;

		/*
		 * We have isolated some pages, but then failed. Release them
		 * instead of migrating, as we cannot form the cc->order buddy
		 * page anyway.
		 */
		if (nr_isolated) {
			if (locked) {
				spin_unlock_irqrestore(&pgdat->lru_lock, flags);
				locked = false;
			}
			//迁移扫描器的可移动页链表中的页面释放回去
			putback_movable_pages(&cc->migratepages);
			cc->nr_migratepages = 0;
			nr_isolated = 0;
		}

		if (low_pfn < next_skip_pfn) {
			low_pfn = next_skip_pfn - 1;
			/*
			 * The check near the loop beginning would have updated
			 * next_skip_pfn too, but this is a bit simpler.
			 */
			next_skip_pfn += 1UL << cc->order;
		}
	}

	/*
	 * The PageBuddy() check could have potentially brought us outside
	 * the range to be scanned.
	 */
	if (unlikely(low_pfn > end_pfn))
		low_pfn = end_pfn;

isolate_abort:
	if (locked)
		spin_unlock_irqrestore(&pgdat->lru_lock, flags);

	/*
	 * Updated the cached scanner pfn once the pageblock has been scanned
	 * Pages will either be migrated in which case there is no point
	 * scanning in the near future or migration failed in which case the
	 * failure reason may persist. The block is marked for skipping if
	 * there were no pages isolated in the block or if the block is
	 * rescanned twice in a row.
	 */
	if (low_pfn == end_pfn && (!nr_isolated || cc->rescan)) {
		if (valid_page && !skip_updated)
			set_pageblock_skip(valid_page);
		update_cached_migrate(cc, low_pfn);
	}

	trace_mm_compaction_isolate_migratepages(start_pfn, low_pfn,
						nr_scanned, nr_isolated);

fatal_pending:
	cc->total_migrate_scanned += nr_scanned;
	if (nr_isolated)
		count_compact_events(COMPACTISOLATED, nr_isolated);

	return low_pfn;
}

isolate_migratepages_block执行流程如下:

  1. 如果有太多的隔离页:当前还没有开始迁移页面,或者是异步迁移,或者当前进程有信号要处理,返回0。
  2. 如果是直接内存碎片整理,并且模式是异步的,开启失败就跳过然后设置跳过的页帧条件。
  3. for循环遍历块中的每一个物理页,在循环中执行以下操作:
  4. 如果开启了失败就跳过,并且到达了跳过的页帧条件,还已经隔离了一些页面,退出循环。
  5. 如果页帧无效,返回失败
  6. 如果页属于页分配器,说明页是空闲的,那么跳过这个页。
  7. 如果页是复合页,例如透明巨型页或 hugetlbfs 巨型页,隔离失败,返回。
  8. 如果是 非LRU可移动页,并且没有被隔离,调用函数isolate_movable_page隔离这个页,
  9. 如果是匿名页并且引用计数大于映射计数,说明内核的某个地方正在访问这个匿名页,隔离失败,返回
  10. 如果是文件页,但是调用者不允许调用文件系统的接口,不能隔离,返回失败
  11. 调用函数 __isolate_lru_page 隔离页,成功则调用函数del_page_from_lru_list把页从 LRU 链表中删除,失败则返回
  12. 把页添加到迁移扫描器的可移动页链表中,
  13. 如果已经隔离了 32 页,退出循环,否则回到3继续循环
  14. 隔离失败的处理:调用函数putback_movable_pages把迁移扫描器的可移动页链表中的页面释放回去

1.4 migrate_pages

函数使用实例:

migrate_pages(&cc->migratepages,compaction_alloc,compaction_free,(unsigned long)cc, cc->mode,MR_COMPACTION);

函数作用:将列表中指定的页迁移到作为页迁移目标提供的空闲页。
第一个参数是迁移的列表;
第二个参数是用于分配空闲页面作为页面迁移目标的函数;
第三个参数用于分配空闲页面作为页面迁移目标的函数;
第四个参数是传递给第二个参数的私有数据
第五个参数是指定页面迁移约束的迁移模式,
第六个参数是页面迁移的原因。

int migrate_pages(struct list_head *from, new_page_t get_new_page,
		free_page_t put_new_page, unsigned long private,
		enum migrate_mode mode, int reason)
{
	int retry = 1;
	int thp_retry = 1;
	int nr_failed = 0;
	int nr_succeeded = 0;
	int nr_thp_succeeded = 0;
	int nr_thp_failed = 0;
	int nr_thp_split = 0;
	int pass = 0;
	bool is_thp = false;
	struct page *page;
	struct page *page2;
	int swapwrite = current->flags & PF_SWAPWRITE;
	int rc, nr_subpages;

	if (!swapwrite)
		current->flags |= PF_SWAPWRITE;
	//尝试10次后返回
	for (pass = 0; pass < 10 && (retry || thp_retry); pass++) {
		retry = 0;
		thp_retry = 0;
		//遍历每一个页
		list_for_each_entry_safe(page, page2, from, lru) {
retry:
			/*
			 * THP statistics is based on the source huge page.
			 * Capture required information that might get lost
			 * during migration.
			 */
			is_thp = PageTransHuge(page) && !PageHuge(page);
			nr_subpages = thp_nr_pages(page);
			cond_resched();

			if (PageHuge(page))//如果是hugetlbfs巨型页
				//把可移动页移到空闲页
				rc = unmap_and_move_huge_page(get_new_page,
						put_new_page, private, page,
						pass > 2, mode, reason);
			else//否则就是普通页或透明巨型页
				//把可移动页移到空闲页
				rc = unmap_and_move(get_new_page, put_new_page,
						private, page, pass > 2, mode,
						reason);

			switch(rc) {
			case -ENOMEM:
				/*
				 * THP migration might be unsupported or the
				 * allocation could've failed so we should
				 * retry on the same page with the THP split
				 * to base pages.
				 *
				 * Head page is retried immediately and tail
				 * pages are added to the tail of the list so
				 * we encounter them after the rest of the list
				 * is processed.
				 */
				if (is_thp) {
					lock_page(page);
					rc = split_huge_page_to_list(page, from);
					unlock_page(page);
					if (!rc) {
						list_safe_reset_next(page, page2, lru);
						nr_thp_split++;
						goto retry;
					}

					nr_thp_failed++;
					nr_failed += nr_subpages;
					goto out;
				}
				nr_failed++;
				goto out;
			case -EAGAIN:
				if (is_thp) {
					thp_retry++;
					break;
				}
				retry++;
				break;
			case MIGRATEPAGE_SUCCESS:
				if (is_thp) {
					nr_thp_succeeded++;
					nr_succeeded += nr_subpages;
					break;
				}
				nr_succeeded++;
				break;
			default:
				/*
				 * Permanent failure (-EBUSY, -ENOSYS, etc.):
				 * unlike -EAGAIN case, the failed page is
				 * removed from migration page list and not
				 * retried in the next outer loop.
				 */
				if (is_thp) {
					nr_thp_failed++;
					nr_failed += nr_subpages;
					break;
				}
				nr_failed++;
				break;
			}
		}
	}
	nr_failed += retry + thp_retry;
	nr_thp_failed += thp_retry;
	rc = nr_failed;
out:
	count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
	count_vm_events(PGMIGRATE_FAIL, nr_failed);
	count_vm_events(THP_MIGRATION_SUCCESS, nr_thp_succeeded);
	count_vm_events(THP_MIGRATION_FAIL, nr_thp_failed);
	count_vm_events(THP_MIGRATION_SPLIT, nr_thp_split);
	trace_mm_migrate_pages(nr_succeeded, nr_failed, nr_thp_succeeded,
			       nr_thp_failed, nr_thp_split, mode, reason);

	if (!swapwrite)
		current->flags &= ~PF_SWAPWRITE;

	return rc;
}

migrate_pages执行流程如下:

  1. 使用for循环,尝试10次后返回,在循环体中执行以下操作:
  2. 使用list_for_each_entry_safe遍历第一个参数的每一个页,在循环体中执行以下操作:
  3. 如果是hugetlbfs巨型页,调用函数unmap_and_move_huge_page把可移动页移到空闲页,如果是普通页或透明巨型页,调用函数unmap_and_move把可移动页移到空闲页。
  4. 各种移动错误的处理。

其实unmap_and_move_huge_page和unmap_and_move是差不多的,我们就只看unmap_and_move这个平时使用更加频繁的这个吧:

static int unmap_and_move(new_page_t get_new_page,
				   free_page_t put_new_page,
				   unsigned long private, struct page *page,
				   int force, enum migrate_mode mode,
				   enum migrate_reason reason)
{
	int rc = MIGRATEPAGE_SUCCESS;
	struct page *newpage = NULL;

	if (!thp_migration_supported() && PageTransHuge(page))
		return -ENOMEM;

	if (page_count(page) == 1) {
		/* page was freed from under us. So we are done. */
		ClearPageActive(page);
		ClearPageUnevictable(page);
		if (unlikely(__PageMovable(page))) {
			lock_page(page);
			if (!PageMovable(page))
				__ClearPageIsolated(page);
			unlock_page(page);
		}
		goto out;
	}
	//从空闲扫描器的空闲页链表中取一个空闲页,
	//如果空闲页链表是空的,空闲扫描器扫描空闲页并将之添加到空闲页链表中。
	newpage = get_new_page(page, private);
	if (!newpage)
		return -ENOMEM;

	rc = __unmap_and_move(page, newpage, force, mode);//可移动页移到空闲页
	if (rc == MIGRATEPAGE_SUCCESS)//启动成功
		set_page_owner_migrate_reason(newpage, reason);//设置迁移的原因

out:
	if (rc != -EAGAIN) {//重试
		/*
		 * A page that has been migrated has all references
		 * removed and will be freed. A page that has not been
		 * migrated will have kept its references and be restored.
		 */
		list_del(&page->lru);

		/*
		 * Compaction can migrate also non-LRU pages which are
		 * not accounted to NR_ISOLATED_*. They can be recognized
		 * as __PageMovable
		 */
		if (likely(!__PageMovable(page)))//如果是可移动页
			mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
					page_is_file_lru(page), -thp_nr_pages(page));
	}

	/*
	 * If migration is successful, releases reference grabbed during
	 * isolation. Otherwise, restore the page to right list unless
	 * we want to retry.
	 */
	if (rc == MIGRATEPAGE_SUCCESS) {//如果迁移成功
		if (reason != MR_MEMORY_FAILURE)
			/*
			 * We release the page in page_handle_poison.
			 */
			put_page(page);//释放page的引用,引用计数减一
	} else {
		if (rc != -EAGAIN) {//如果迁移失败,需要重试
			if (likely(!__PageMovable(page))) {//如果是可移动页
				putback_lru_page(page);
				goto put_new;
			}

			lock_page(page);
			if (PageMovable(page))
				putback_movable_page(page);
			else
				__ClearPageIsolated(page);
			unlock_page(page);
			put_page(page);
		}
put_new:
		if (put_new_page)
			put_new_page(newpage, private);
		else
			put_page(newpage);
	}

	return rc;
}

unmap_and_move执行流程如下:

  1. 调用函数get_new_page,这个函数是个传入参数,真正调用的是compaction_alloc,作用从空闲扫描器的空闲页链表中取一个空闲页,如果空闲页链表是空的,空闲扫描器扫描空闲页并将之添加到空闲页链表中。
  2. 调用函数__unmap_and_move把可移动页移到空闲页
  3. 如果迁移成功,释放page的引用,引用计数减一
  4. 如果失败,调用函数put_new_page,这个函数是个传入参数,真正调用的是compaction_free,作用是把步骤1得到的新页放回空闲扫描器的空闲页链表中。

接下来,我们需要看看着几个函数__unmap_and_move、compaction_alloc和compaction_free。

1.4.1 __unmap_and_move

static int __unmap_and_move(struct page *page, struct page *newpage,
				int force, enum migrate_mode mode)
{
	int rc = -EAGAIN;
	int page_was_mapped = 0;
	struct anon_vma *anon_vma = NULL;
	bool is_lru = !__PageMovable(page);//__PageMovable返回真就表示页是非LRU可移动页

	if (!trylock_page(page)) {//尝试锁住旧的物理页。如果失败
		if (!force || mode == MIGRATE_ASYNC)//如果其他进程持有锁
			goto out;	//返回失败

		if (current->flags & PF_MEMALLOC)//当前进程正在申请内存
			goto out;//避免死锁,返回

		lock_page(page);//加锁
	}

	if (PageWriteback(page)) {//如果页正在回写到存储设备
		switch (mode) {
		case MIGRATE_SYNC:
		case MIGRATE_SYNC_NO_COPY:
			break;
		default:
			rc = -EBUSY;
			goto out_unlock;
		}
		if (!force)//没有强制迁移
			goto out_unlock;//返回
		wait_on_page_writeback(page);//等待页回写完成
	}

	if (PageAnon(page) && !PageKsm(page))//如果是匿名页但不是内核共享页
		anon_vma = page_get_anon_vma(page);//获取page的anon_vma

	if (unlikely(!trylock_page(newpage)))//加锁
		goto out_unlock;

	if (unlikely(!is_lru)) {//如果是非LRU可移动页
		rc = move_to_new_page(newpage, page, mode);//把页移动到新页
		goto out_unlock_both;
	}

	//这里就是LRU可移动页
	if (!page->mapping) {//如果页没有映射
		VM_BUG_ON_PAGE(PageAnon(page), page);
		if (page_has_private(page)) {//如果页有私有内容,说明是文件页缓存
			try_to_free_buffers(page);//释放页缓存
			goto out_unlock_both;
		}
	} else if (page_mapped(page)) {//如果页有映射
		VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma,
				page);
		//从进程的页表中删除虚拟页到这个物理页的映射
		try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK);
		page_was_mapped = 1;//记录页被映射了
	}

	if (!page_mapped(page))//删除映射成功
		rc = move_to_new_page(newpage, page, mode);//把页移动到新页

	if (page_was_mapped)//如果页之前是被映射的
		remove_migration_ptes(page,	//把页表项设置为特殊的迁移页表项
			rc == MIGRATEPAGE_SUCCESS ? newpage : page, false);

out_unlock_both:
	unlock_page(newpage);//解锁新的物理页
out_unlock:
	/* Drop an anon_vma reference if we took one */
	if (anon_vma)
		put_anon_vma(anon_vma);
	unlock_page(page);//解锁旧的物理页
out:
	if (rc == MIGRATEPAGE_SUCCESS) {//如果迁移成功
		if (unlikely(!is_lru))//如果是非LRU可移动页
			put_page(newpage);//新的物理页引用计数减一
		else//如果是LRU可移动页
			putback_lru_page(newpage);//把新的物理页插入LRU链表
	}

	return rc;
}

__unmap_and_move执行流程如下:

  1. 如果页正在回写到存储设备,迁移模式是异步或者轻量级同步,调用函数wait_on_page_writeback等待页回写完成。
  2. 如果是匿名页但不是内核共享页,获取page的anon_vma
  3. 如果是非LRU可移动页,调用函数move_to_new_page把页移动到新页
  4. 如果是LRU可移动页:如果页没有映射,如果页有私有内容,说明是文件页缓存,调用函数try_to_free_buffers释放页缓存。
  5. 如果是LRU可移动页:如果页有映射,调用函数try_to_unmap从进程的页表中删除虚拟页到这个物理页的映射。如果删除映射成功,调用函数move_to_new_page把页移动到新页。
  6. 如果页之前是被映射的,调用函数remove_migration_ptes把页表项设置为特殊的迁移页表项
  7. 如果迁移成功:如果是非LRU可移动页,新的物理页引用计数减一;否则就是LRU可移动页,调用函数putback_lru_page把新的物理页插入LRU链表。

1.4.1.1 move_to_new_page

static int move_to_new_page(struct page *newpage, struct page *page,
				enum migrate_mode mode)
{
	struct address_space *mapping;
	int rc = -EAGAIN;
	bool is_lru = !__PageMovable(page);//如果是lru可移动页

	VM_BUG_ON_PAGE(!PageLocked(page), page);
	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);

	mapping = page_mapping(page);//获取page的address_space

	if (likely(is_lru)) {//如果是lru
		if (!mapping)//如果是匿名页
			rc = migrate_page(mapping, newpage, page, mode);//迁移单个lru页面
		else if (mapping->a_ops->migratepage)//如果mapping有migratepage方法
			/*
			 * Most pages have a mapping and most filesystems
			 * provide a migratepage callback. Anonymous pages
			 * are part of swap space which also has its own
			 * migratepage callback. This is the most common path
			 * for page migration.
			 */
			rc = mapping->a_ops->migratepage(mapping, newpage,
							page, mode);//调用mapping的migratepage方法
		else
			rc = fallback_migrate_page(mapping, newpage,
							page, mode);//默认的页面迁移操作
	} else {//否则是非lru
		VM_BUG_ON_PAGE(!PageIsolated(page), page);
		if (!PageMovable(page)) {//如果page有isolate_page方法
			rc = MIGRATEPAGE_SUCCESS;
			__ClearPageIsolated(page);//清理隔离标志位
			goto out;//返回
		}
		//如果page没有isolate_page方法,并且回调migratepage方法
		rc = mapping->a_ops->migratepage(mapping, newpage,
						page, mode);
		WARN_ON_ONCE(rc == MIGRATEPAGE_SUCCESS &&
			!PageIsolated(page));
	}

	/*
	 * When successful, old pagecache page->mapping must be cleared before
	 * page is freed; but stats require that PageAnon be left as PageAnon.
	 */
	if (rc == MIGRATEPAGE_SUCCESS) {
		if (__PageMovable(page)) {
			VM_BUG_ON_PAGE(!PageIsolated(page), page);

			/*
			 * We clear PG_movable under page_lock so any compactor
			 * cannot try to migrate this page.
			 */
			__ClearPageIsolated(page);
		}

		/*
		 * Anonymous and movable page->mapping will be cleared by
		 * free_pages_prepare so don't reset it here for keeping
		 * the type to work PageAnon, for example.
		 */
		if (!PageMappingFlags(page))
			page->mapping = NULL;//清除旧页的映射

		if (likely(!is_zone_device_page(newpage))) {//如果不是设备内存
			int i, nr = compound_nr(newpage);

			for (i = 0; i < nr; i++)
				flush_dcache_page(newpage + i);//清除旧页的cache
		}
	}
out:
	return rc;
}

move_to_new_page执行流程如下:

  1. 如果是lru:如果是匿名页,调用函数migrate_page迁移单个lru页面
  2. 如果是lru:如果是文件页,并且文件系统有migratepage回调方法,调用migratepage方法。
  3. 如果是lru:如果是文件页,并且文件系统没有migratepage回调方法,调用函数fallback_migrate_page迁移。
  4. 如果是非lru:如果page有isolate_page方法,清理隔离标志位后返回成功
  5. 如果是非lru:如果page没有isolate_page方法,回调migratepage方法。
  6. 如果迁移成功,清除旧页的映射,如果不是设备内存,还要清除cache中旧页的数据。
1.4.1.1.1 migrate_page
int migrate_page(struct address_space *mapping,
		struct page *newpage, struct page *page,
		enum migrate_mode mode)
{
	int rc;

	BUG_ON(PageWriteback(page));	/* Writeback must be complete */

	//替换映射中的页面,就要是page结构体的内容更新
	rc = migrate_page_move_mapping(mapping, newpage, page, 0);

	if (rc != MIGRATEPAGE_SUCCESS)
		return rc;

	if (mode != MIGRATE_SYNC_NO_COPY)
		migrate_page_copy(newpage, page);//将page页的内容复制的newpage
	else
		migrate_page_states(newpage, page);//将page页的状态复制的newpage
	return MIGRATEPAGE_SUCCESS;
}
EXPORT_SYMBOL(migrate_page);

migrate_page执行流程如下:

  1. 调用函数migrate_page_move_mapping替换映射中的页面,就要是page结构体的内容更新
  2. 如果迁移模式不是MIGRATE_SYNC_NO_COPY,调用函数migrate_page_copy将page页的内容复制的newpage
  3. 如果迁移模式是MIGRATE_SYNC_NO_COPY,调用函数migrate_page_states将page页的状态复制的newpage

migrate_page_move_mapping函数主要是把page结构体成员更新到新页中,包括index、mapping和各种标志位,比如脏。migrate_page_copy函数可以看看:

void migrate_page_copy(struct page *newpage, struct page *page)
{
	if (PageHuge(page) || PageTransHuge(page))
		copy_huge_page(newpage, page);//把page的内容复制到newpage
	else
		copy_highpage(newpage, page);//把page的内容复制到newpage

	migrate_page_states(newpage, page);//将page页的状态复制的newpage
}

migrate_page_copy函数主要是判断page的类型,然后调用copy_huge_page或者copy_highpage把page的内容复制到newpage,其实copy_huge_page也是调用到copy_highpage函数的,而copy_highpage函数就是简单的调用memcpy进行内存复制而已。

1.4.1.1.2 fallback_migrate_page
static int fallback_migrate_page(struct address_space *mapping,
	struct page *newpage, struct page *page, enum migrate_mode mode)
{
	if (PageDirty(page)) {//如果是脏页
		switch (mode) {
		case MIGRATE_SYNC:
		case MIGRATE_SYNC_NO_COPY:
			break;
		default:
			return -EBUSY;
		}
		return writeout(mapping, page);//回写一页以清除脏状态
	}

	if (page_has_private(page) &&	//如果有私有数据
	    !try_to_release_page(page, GFP_KERNEL))//并且没有办法释放
		return mode == MIGRATE_SYNC ? -EAGAIN : -EBUSY;//返回错误

	return migrate_page(mapping, newpage, page, mode);//迁移单个页面
}

fallback_migrate_page执行流程如下:

  1. 如果是脏页,并且是完全同步模式,调用函数writeout回写一页以清除脏状态后返回。
  2. 如果是脏页,但不是完全同步模式,返回错误
  3. 如果有私有数据,并且没有办法释放私有数据,返回错误
  4. 调用函数migrate_page迁移单个页面。

1.4.1.2 remove_migration_ptes

void remove_migration_ptes(struct page *old, struct page *new, bool locked)
{
	struct rmap_walk_control rwc = {
		每获取一个vma就会调用一次此函数,
		.rmap_one = remove_migration_pte,//将一个潜在的迁移pte恢复到一个工作pte条目
		.arg = old,//remove_migration_pte的最后一个参数
	};

	if (locked)
		rmap_walk_locked(new, &rwc);//反向映射遍历vma函数
	else
		rmap_walk(new, &rwc);//反向映射遍历vma函数
}

remove_migration_ptes主要工作是初始化rmap_walk_control 结构体,然后调用函数rmap_walk反向映射遍历vma函数。rmap_walk:

void rmap_walk(struct page *page, struct rmap_walk_control *rwc)
{
	if (unlikely(PageKsm(page)))
		rmap_walk_ksm(page, rwc);
	else if (PageAnon(page))
		rmap_walk_anon(page, rwc, false);
	else
		rmap_walk_file(page, rwc, false);
}

如果是内核共享页,调用函数rmap_walk_ksm;如果是匿名页,调用函数rmap_walk_anon;否则就是文件页了,调用rmap_walk_file。
其实这几个函数都是差不多的,主要差异在于文件页和匿名页的vma是不一样的,要通过不同的方法遍历vma而已,最终都是调用rwc的rmap_one 成员,也就是remove_migration_pte函数。好吧这个函数目前我也看不懂。只知道他的作用:本来vma映射的是旧的page,这个函数让这个vma映射到了新的page上。

1.4.2 compaction_alloc

static struct page *compaction_alloc(struct page *migratepage,
					unsigned long data)
{
	struct compact_control *cc = (struct compact_control *)data;
	struct page *freepage;

	if (list_empty(&cc->freepages)) {
		isolate_freepages(cc);

		if (list_empty(&cc->freepages))
			return NULL;
	}

	freepage = list_entry(cc->freepages.next, struct page, lru);
	list_del(&freepage->lru);
	cc->nr_freepages--;

	return freepage;
}

compaction_alloc执行流程如下:

  1. 如果扫面器的空闲列表为空,就调用函数isolate_freepages扫描,把伙伴系统中的空闲页面放入扫面器的空闲列表中,如果扫描完了,扫面器的空闲列表还是空,返回NULL
  2. 在扫面器的空闲列表中取出一个空闲页面,并且返回这个page。

isolate_freepages函数和前面的isolate_migratepages函数是差不多的,只是有两个不同而已:一个是隔离空闲页和隔离可迁移页的区别,另一个是扫描的方向是相反的。仅此而已,这里也不多描述了。

1.4.3 compaction_free

static void compaction_free(struct page *page, unsigned long data)
{
	struct compact_control *cc = (struct compact_control *)data;

	list_add(&page->lru, &cc->freepages);
	cc->nr_freepages++;
}

compaction_free很简单,就是把cc->freepages放回page->lru中。