一、背景
最后来介绍一下KASAN_HW_TAGS,ARM64上就是MTE,这个特性在ARMv8.5支持,实际目前市面支持MTE的芯片都是ARMv9了; 由于这个特性依赖硬件支持,本文利用qemu 学习这个feature。
二、KASAN_HW_TAGS (MTE)使能相关配置
内核相关配置
CONFIG_HAVE_ARCH_KASAN=y
CONFIG_HAVE_ARCH_KASAN_SW_TAGS=y
CONFIG_HAVE_ARCH_KASAN_HW_TAGS=y
CONFIG_HAVE_ARCH_KASAN_VMALLOC=y
CONFIG_CC_HAS_KASAN_GENERIC=y
CONFIG_CC_HAS_KASAN_SW_TAGS=y
CONFIG_KASAN=y
# CONFIG_KASAN_GENERIC is not set
# CONFIG_KASAN_SW_TAGS is not set
CONFIG_KASAN_HW_TAGS=y //mte相关
CONFIG_KASAN_VMALLOC=y
MTE 相关feature 是否打开
502 # ARMv8.5 architectural features
503 #
504 CONFIG_AS_HAS_ARMV8_5=y
......
508 CONFIG_ARM64_AS_HAS_MTE=y
509 CONFIG_ARM64_MTE=y
确认MTE是否正常打开
geek@geek-virtual-machine:~/workspace/linux/qemu$ ./linux_boot.sh
qemu-system-aarch64: MTE requested, but not supported by the guest CPU
调试时遇到,MTE未打开的情况,可以打断点在 kasan_init_hw_tags
void __init kasan_init_hw_tags(void)
{
/* If hardware doesn't support MTE, don't initialize KASAN. */
if (!system_supports_mte())
return;
......
/* KASAN is now initialized, enable it. */
static_branch_enable(&kasan_flag_enabled);
pr_info("KernelAddressSanitizer initialized (hw-tags, mode=%s, vmalloc=%s, stacktrace=%s)\n",
kasan_mode_info(),
kasan_vmalloc_enabled() ? "on" : "off",
kasan_stack_collection_enabled() ? "on" : "off");
}
上面的异常最终确认是之前所使用的CPU类型不支持,修改的qemu启动脚本如下:
主要是machine增加mte=on字段,CPU选择支持mte的架构,如:cortex-a710
qemu-system-aarch64 \
-machine virt,gic-version=3,mte=on \
-nographic \
-m size=2048M \
-cpu cortex-a710 \
-smp 8 \
-kernel Image \
-drive format=raw,file=rootfs.img \
-append "root=/dev/vda rw nokaslr kasan=on kasan.mode=sync kasan.stacktrace=on kasan.fault=report " \
-s
成功打开时,内核kmsg会打印:
kasan: KernelAddressSanitizer initialized (hw-tags, mode=sync, vmalloc=on, stacktrace=on)
三、KASAN_HW_TAGS(MTE)基本原理
MTE的lock和key模型
MTE中key存放在指针高byte中,lock则是对内存的标记,只有key和lock匹配时,才能正常访问和操作内存。
MTE新增的指令
Instruction | Name |
ADDG | Add with Tag |
CMPP | Compare with Tag |
GMI | Tag Mask Insert |
IRG | Insert Random Tag |
LDG | Load Allocation Tag |
LDGV | Load Tag Vector |
ST2G | Store Allocaton Tags to two granules |
STG | Store Allocation Tag |
STGP | Store Allocation Tag and Pair |
STGV | Store Tag Vector |
STZ2G | Store Allocation Tags to two granules Zeroing |
STZG | Store Allocation Tag, Zeroing |
SUBG | Subtract with Tag |
SUBP | Subtract Pointer |
SUBPS | Subtract Pointer, setting Flags |
... |
基本上MTE的使用分为三步:
1、memtag create(lock)
2、address tag(指针key)
MTE 需要结合ARM64的TBI(Top Byte Ignore)特性,在指针最高byte存储tag信息,这个实现和前面介绍的KASAN_SW_TAGS类似,不过MTE只需要4bit就够了。
3、tag check
四、Linux中KASAN_HW_TAGS(MTE)关键实现
4.1 先看一个例子日志
/test # echo 0 > /dev/kasan_test
[ 156.628134] kmalloc_oob_right f9ff0000038b5000
[ 156.629125] ==================================================================
[ 156.633409] BUG: KASAN: invalid-access in kmalloc_oob_right.constprop.0+0x48/0x64 [kasan_driver]
[ 156.634892] Write at addr f9ff0000038b5081 by task sh/179
[ 156.635552] Pointer tag: [f9], memory tag: [fe]
[ 156.635990]
[ 156.636490] CPU: 4 PID: 179 Comm: sh Tainted: G N 6.6.1-gf1e080ccc5c5-dirty #19
[ 156.637310] Hardware name: linux,dummy-virt (DT)
[ 156.637771] Call trace:
[ 156.638111] dump_backtrace+0x90/0xe8
[ 156.638721] show_stack+0x18/0x24
[ 156.639046] dump_stack_lvl+0x48/0x60
[ 156.639391] print_report+0x100/0x600
[ 156.639703] kasan_report+0x84/0xac
[ 156.640034] __do_kernel_fault+0xa4/0x194
[ 156.640376] do_tag_check_fault+0x78/0x8c
[ 156.640724] do_mem_abort+0x44/0x94
[ 156.641052] el1_abort+0x40/0x60
[ 156.641367] el1h_64_sync_handler+0xa4/0xe4
[ 156.641719] el1h_64_sync+0x64/0x68
[ 156.642042] kmalloc_oob_right.constprop.0+0x48/0x64 [kasan_driver]
[ 156.642511] kasan_test_case+0x38/0xb0 [kasan_driver]
[ 156.642921] kasan_testcase_write+0x7c/0xf4 [kasan_driver]
[ 156.643350] vfs_write+0xc8/0x300
[ 156.643666] ksys_write+0x74/0x10c
[ 156.643986] __arm64_sys_write+0x1c/0x28
[ 156.644336] invoke_syscall+0x48/0x110
[ 156.644681] el0_svc_common.constprop.0+0x40/0xe0
[ 156.645082] do_el0_svc+0x1c/0x28
[ 156.645415] el0_svc+0x40/0x114
[ 156.645728] el0t_64_sync_handler+0x120/0x12c
[ 156.646092] el0t_64_sync+0x19c/0x1a0
[ 156.646528]
[ 156.646749] The buggy address belongs to the object at ffff0000038b5080
[ 156.646749] which belongs to the cache kmalloc-128 of size 128
[ 156.647547] The buggy address is located 1 bytes inside of
[ 156.647547] 128-byte region [ffff0000038b5080, ffff0000038b5100)
[ 156.648270]
[ 156.648533] The buggy address belongs to the physical page:
[ 156.649067] page:00000000ffd93f36 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x438b5
[ 156.650024] flags: 0x3fffc0000000800(slab|node=0|zone=0|lastcpupid=0xffff|kasantag=0x0)
[ 156.651089] page_type: 0xffffffff()
[ 156.651723] raw: 03fffc0000000800 f6ff000002c02600 dead000000000122 0000000000000000
[ 156.652262] raw: 0000000000000000 0000000080200020 00000001ffffffff 0000000000000000
[ 156.652786] page dumped because: kasan: bad access detected
[ 156.653183]
[ 156.653375] Memory state around the buggy address:
[ 156.653836] ffff0000038b4e00: f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7
[ 156.654346] ffff0000038b4f00: f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7 f7
[ 156.654857] >ffff0000038b5000: f9 f9 f9 f9 f9 f9 f9 f9 fe fe fe fe fe fe fe fe
[ 156.655342] ^
[ 156.655870] ffff0000038b5100: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
[ 156.656351] ffff0000038b5200: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
[ 156.656842] ==================================================================
[ 156.657836] Disabling lock debugging due to kernel taint
[ 156.659261] kasan_test_case type 0
上面的例子触发越界访问, key 是f9, 访问到越界内存,而越界内存的 memory tag(lock)是fe, 所以触发异常。
4.2 关键代码分析:
测试代码中函数kmalloc_oob_right分析,转化成汇编之后可以看到基于MTE的实现方法在触发越界时不需要像之前kasan/sw_tag kasan那样有读取tag对比的代码了,MTE中这些都是硬件实现的
(gdb) disassemble
Dump of assembler code for function kmalloc_oob_right:
0xffff80007a8801b0 <+0>: paciasp
=> 0xffff80007a8801b4 <+4>: adrp x0, 0xffff800081a2d000 <cpucap_ptrs+272>
0xffff80007a8801b8 <+8>: stp x29, x30, [sp, #-32]!
0xffff80007a8801bc <+12>: mov x2, #0x80 // #128
0xffff80007a8801c0 <+16>: mov w1, #0xcc0 // #3264
0xffff80007a8801c4 <+20>: mov x29, sp
0xffff80007a8801c8 <+24>: ldr x0, [x0, #1752]
0xffff80007a8801cc <+28>: str x19, [sp, #16]
0xffff80007a8801d0 <+32>: bl 0xffff80008022e498 <kmalloc_trace>
0xffff80007a8801d4 <+36>: mov x19, x0
0xffff80007a8801d8 <+40>: adrp x1, 0xffff80007a884000
0xffff80007a8801dc <+44>: add x1, x1, #0x110
0xffff80007a8801e0 <+48>: mov x2, x0
0xffff80007a8801e4 <+52>: add x1, x1, #0x30
0xffff80007a8801e8 <+56>: adrp x0, 0xffff80007a884000
0xffff80007a8801ec <+60>: add x0, x0, #0x50
0xffff80007a8801f0 <+64>: bl 0xffff8000800f45a0 <_printk>
0xffff80007a8801f4 <+68>: mov w1, #0x79 // #121
0xffff80007a8801f8 <+72>: strb w1, [x19, #129] //触发越界写入
0xffff80007a8801fc <+76>: mov x0, x19
0xffff80007a880200 <+80>: bl 0xffff80008022f5d0 <kfree>
0xffff80007a880204 <+84>: ldr x19, [sp, #16]
0xffff80007a880208 <+88>: ldp x29, x30, [sp], #32
0xffff80007a88020c <+92>: autiasp
0xffff80007a880210 <+96>: ret
设置memtag, 还是用kmalloc为例:
kmalloc
-->kmalloc_trace
-->__kmem_cache_alloc_node
-->slab_alloc_node
-->slab_post_alloc_hook
-->kasan_slab_alloc
void * __must_check __kasan_slab_alloc(struct kmem_cache *cache,
void *object, gfp_t flags, bool init)
{
....
/*
* Generate and assign random tag for tag-based modes.
* Tag is ignored in set_tag() for the generic mode.
*/
tag = assign_tag(cache, object, false); // 1、随机数分配tag
tagged_object = set_tag(object, tag); // 2、设置tag 到指针
/*
* Unpoison the whole object.
* For kmalloc() allocations, kasan_kmalloc() will do precise poisoning.
*/
kasan_unpoison(tagged_object, cache->object_size, init);
3、设置memtag
/* Save alloc info (if possible) for non-kmalloc() allocations. */
if (kasan_stack_collection_enabled() && !is_kmalloc_cache(cache))
kasan_save_alloc_info(cache, tagged_object, flags);
return tagged_object;
}
#if defined(CONFIG_KASAN_SW_TAGS) || defined(CONFIG_KASAN_HW_TAGS)
#define __tag_shifted(tag) ((u64)(tag) << 56)
#define __tag_reset(addr) __untagged_addr(addr)
#define __tag_get(addr) (__u8)((u64)(addr) >> 56)
1、分配tag
static inline u8 assign_tag(struct kmem_cache *cache,
const void *object, bool init)
{
if (IS_ENABLED(CONFIG_KASAN_GENERIC))
return 0xff;
/*
* If the cache neither has a constructor nor has SLAB_TYPESAFE_BY_RCU
* set, assign a tag when the object is being allocated (init == false).
*/https://www.kernel.org/doc/html/v5.15/arm64/memory-tagging-extension.html
if (!cache->ctor && !(cache->flags & SLAB_TYPESAFE_BY_RCU))
return init ? KASAN_TAG_KERNEL : kasan_random_tag();
/* For caches that either have a constructor or SLAB_TYPESAFE_BY_RCU: */
#ifdef CONFIG_SLAB
/* For SLAB assign tags based on the object index in the freelist. */
return (u8)obj_to_index(cache, virt_to_slab(object), (void *)object);
#else
/*
* For SLUB assign a random tag during slab creation, otherwise reuse
* the already assigned tag.
*/
return init ? kasan_random_tag() : get_tag(object);
#endif
}
static inline u8 kasan_random_tag(void) { return hw_get_random_tag(); }
#ifdef CONFIG_KASAN_HW_TAGS
...
#define hw_get_random_tag() arch_get_random_tag()
#define hw_get_mem_tag(addr) arch_get_mem_tag(addr)
#define hw_set_mem_tag_range(addr, size, tag, init) \
arch_set_mem_tag_range((addr), (size), (tag), (init))
#ifdef CONFIG_KASAN_HW_TAGS
...
#define arch_get_random_tag() mte_get_random_tag()
#define arch_get_mem_tag(addr) mte_get_mem_tag(addr)
#define arch_set_mem_tag_range(addr, size, tag, init) \
mte_set_mem_tag_range((addr), (size), (tag), (init))
#endif /* CONFIG_KASAN_HW_TAGS */
/* Generate a random tag. */
static inline u8 mte_get_random_tag(void)
{
void *addr;
asm(__MTE_PREAMBLE "irg %0, %0"
: "=r" (addr));
return mte_get_ptr_tag(addr);
}
设置memtag
static inline void kasan_poison(const void *addr, size_t size, u8 value, bool init)
{
addr = kasan_reset_tag(addr);
/* Skip KFENCE memory if called explicitly outside of sl*b. */
if (is_kfence_address(addr))
return;
if (WARN_ON((unsigned long)addr & KASAN_GRANULE_MASK))
return;
if (WARN_ON(size & KASAN_GRANULE_MASK))
return;
hw_set_mem_tag_range((void *)addr, size, value, init);
}
对比之前的定义:
#define hw_set_mem_tag_range(addr, size, tag, init) \
arch_set_mem_tag_range((addr), (size), (tag), (init))
#define arch_set_mem_tag_range(addr, size, tag, init) \
mte_set_mem_tag_range((addr), (size), (tag), (init))
static inline void mte_set_mem_tag_range(void *addr, size_t size, u8 tag,
bool init)
{
u64 curr, mask, dczid, dczid_bs, dczid_dzp, end1, end2, end3;
/* Read DC G(Z)VA block size from the system register. */
dczid = read_cpuid(DCZID_EL0);
dczid_bs = 4ul << (dczid & 0xf);
dczid_dzp = (dczid >> 4) & 1;
curr = (u64)__tag_set(addr, tag);
mask = dczid_bs - 1;
/* STG/STZG up to the end of the first block. */
end1 = curr | mask;
end3 = curr + size;
/* DC GVA / GZVA in [end1, end2) */
end2 = end3 & ~mask;
/*
* The following code uses STG on the first DC GVA block even if the
* start address is aligned - it appears to be faster than an alignment
* check + conditional branch. Also, if the range size is at least 2 DC
* GVA blocks, the first two loops can use post-condition to save one
* branch each.
*/
#define SET_MEMTAG_RANGE(stg_post, dc_gva) \
do { \
if (!dczid_dzp && size >= 2 * dczid_bs) {\
do { \
curr = stg_post(curr); \
} while (curr < end1); \
\
do { \
dc_gva(curr); \
curr += dczid_bs; \
} while (curr < end2); \
} \
\
while (curr < end3) \
curr = stg_post(curr); \
} while (0)
if (init)
SET_MEMTAG_RANGE(__stzg_post, __dc_gzva);
else
SET_MEMTAG_RANGE(__stg_post, __dc_gva);
#undef SET_MEMTAG_RANGE
}
static inline u64 __stg_post(u64 p)
{
asm volatile(__MTE_PREAMBLE "stg %0, [%0], #16"
: "+r"(p)
:
: "memory");
return p;
}
上面的核心实现可以看到,主要是两个指令:一个是IRG, 一个是STG, 完成了key和lock的填充。
4.3 tag存在哪里?
MTE将tags分成两类:
Address Tag:也就是key, 是4bit存放在虚拟地址的最高byte中(利用ARM64的TBI 特性)
Memory Tag:也叫lock, Memeory tag也是4bit, 每4byte代表16 byte, 与kasan, sw tag kasan 不同,MTE中Memory tag的存储是由硬件实现的。
看上图实际MTE的tag也是存储在memory上的,按照tag的消耗是4bit标记16byte, 开启MTE后也是会消耗1/32的物理内存,但是这个memory 的地址我们在内核是看不到的,kernel也没有看到设定的地方。
翻看ARM手册,如上图所示有一个Memory Tag Unit(MTU)管理和区分tag storage和data storage。
翻看CI-700的手册中有介绍设置MTE tag存储的物理地址的起始地址,其中还描述了这个寄存器只能在secure(EL3)操作,这也是为什么在内核找不到设置的地方(通常MTE使能的硬件平台会在设备树中增加一个保留内存,这个内存也就是在TZ中被设置,用来存储tag信息)
五、用户空间MTE使用方法
前面讲了内核中的MTE实现和使用,用户空间也是类似的,arm官网提供了一个很好的例子:
/*
* Memory Tagging Extension (MTE) example for Linux
*
* Compile with gcc and use -march=armv8.5-a+memtag
* gcc mte-example.c -o mte-example -march=armv8.5-a+memtag
*
* Compilation should be done on a recent Arm Linux machine for the .h files to include MTE support.
*
*/
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/auxv.h>
#include <sys/mman.h>
#include <sys/prctl.h>
/*
* Insert a random logical tag into the given pointer.
* IRG instruction.
*/
#define insert_random_tag(ptr) ({ \
uint64_t __val; \
asm("irg %0, %1" : "=r" (__val) : "r" (ptr)); \
__val; \
})
/*
* Set the allocation tag on the destination address.
* STG instruction.
*/
#define set_tag(tagged_addr) do { \
asm volatile("stg %0, [%0]" : : "r" (tagged_addr) : "memory"); \
} while (0)
int main(void)
{
unsigned char *ptr; // pointer to memory for MTE demonstration
/*
* Use the architecture dependent information about the processor
* from getauxval() to check if MTE is available.
*/
if (!((getauxval(AT_HWCAP2)) & HWCAP2_MTE))
{
printf("MTE is not supported\n");
return EXIT_FAILURE;
}
else
{
printf("MTE is supported\n");
}
/*
* Enable MTE with synchronous checking
*/
if (prctl(PR_SET_TAGGED_ADDR_CTRL,
PR_TAGGED_ADDR_ENABLE | PR_MTE_TCF_SYNC | (0xfffe << PR_MTE_TAG_SHIFT),
0, 0, 0))
{
perror("prctl() failed");
return EXIT_FAILURE;
}
/*
* Allocate 1 page of memory with MTE protection
*/
ptr = mmap(NULL, sysconf(_SC_PAGESIZE), PROT_READ | PROT_WRITE | PROT_MTE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (ptr == MAP_FAILED)
{
perror("mmap() failed");
return EXIT_FAILURE;
}
/*
* Print the pointer value with the default tag (expecting 0)
*/
printf("pointer is %p\n", ptr);
/*
* Write the first 2 bytes of the memory with the default tag
*/
ptr[0] = 0x41;
ptr[1] = 0x42;
/*
* Read back to confirm the writes
*/
printf("ptr[0] = 0x%hhx ptr[1] = 0x%hhx\n", ptr[0], ptr[1]);
/*
* Generate a random tag and store it for the address (IRG instruction)
*/
ptr = (unsigned char *) insert_random_tag(ptr);
/*
* Set the key on the pointer to match the lock on the memory (STG instruction)
*/
set_tag(ptr);
/*
* Print the pointer value with the new tag
*/
printf("pointer is now %p\n", ptr);
/*
* Write the first 2 bytes of the memory again, with the new tag
*/
ptr[0] = 0x43;
ptr[1] = 0x44;
/*
* Read back to confirm the writes
*/
printf("ptr[0] = 0x%hhx ptr[1] = 0x%hhx\n", ptr[0], ptr[1]);
/*
* Write to memory beyond the 16 byte granule (offsest 0x10)
* MTE should generate an exception
* If the offset is less than 0x10 no SIGSEGV will occur.
*/
printf("Expecting SIGSEGV...\n");
ptr[0x10] = 0x55;
/*
* Program only reaches this if no SIGSEGV occurs
*/
printf("...no SIGSEGV was received\n");
return EXIT_FAILURE;
}
上面的例子很简单,就是利用irg和stg指令给指定的内存生成lock, 指针tag(生成key),然后进行越界访问,会触发异常。
在qemu中执行结果:
六、小结
对比kernel中内存踩踏检测工具
类型 | shadow内存占用 | cpu占用 | 优缺点 |
KASAN | 1/8 | 复杂,每次内存访问,需要计算对比shadow值 | 定位准确,8byte内的踩踏也能检测;32位/64位均能使用 |
KASAN_SW_TAGS | 1/16 | 每次内存访问,需要计算对比shadow值 | 16 byte内的踩踏无法区分, 仅64才能使用(因为依赖arm64 TBI feature) |
KASAN_HW_TAGS(MTE) | 1/32 | 5%左右消耗,tag的生成和检查由硬件完成 | 16 byte内的踩踏无法区分, 仅支持MTE的平台才能使用 |
其实对比KASAN_SW_TAGS, MTE主要是性能上的提升,缺点和能力与KASAN_SW_TAGS接近,MTE的诞生其实不是用来debug, 而是google希望推动MTE在商用版本上落地,最根本的目的是解决内存安全的问题,当前目前的确有性能上的影响(目前厂商均未应用到用户端),随着MTE本身的优化和CPU性能的进一步提升,也许不久的将来会看到MTE落地到产品商用版本上。
参考:
Memory Tagging Extension (MTE) in AArch64 Linux
Learn about the Arm Memory Tagging Extension: Build and run an example application to learn about MTE
Arm 内存标记扩展 (MTE) | Android NDK | Android Developers
https://www.qemu.org/docs/master/system/arm/virt.html
https://www.kernel.org/doc/html/v5.15/arm64/memory-tagging-extension.html
Documentation - Arm Developer