← Back to Logs

How Linux Decides What Runs Next: CFS, EEVDF, and the Scheduler That Powers Everything

Right now, your Linux box is probably running somewhere between 300 and 2000 processes. You have maybe 4 to 16 CPU cores. Something has to decide which of those processes gets a core, for how long, and what happens when something more important shows up. That something is the kernel's process scheduler.

A bad scheduler is immediately noticeable. Your UI stutters while a compile job runs. Audio crackles. A database query takes 3x longer because some batch job ate all the CPU. The scheduler is one of those pieces of infrastructure that only gets attention when it fails, but its design decisions affect everything from desktop responsiveness to datacenter throughput.

Linux has gone through three major scheduler implementations in the last two decades. Each one fixed real problems and introduced new tradeoffs. Let's walk through all of them.

The O(1) Scheduler (Linux 2.6, circa 2003)

Before CFS, Linux used the O(1) scheduler, written by Ingo Molnar. The name describes its key property: task selection happened in constant time regardless of how many runnable tasks existed.

The design was built around two priority arrays per CPU, each containing 140 priority levels (0-99 for real-time, 100-139 for normal tasks). Each priority level had a linked list of tasks. The two arrays were called active and expired.

struct prio_array {
    unsigned int     nr_active;
    unsigned long    bitmap[BITMAP_SIZE];  // 140 bits
    struct list_head queue[MAX_PRIO];      // 140 lists
};

The scheduler picked the highest-priority non-empty list from the active array (a single find_first_bit() operation, hence O(1)), ran the first task on that list, and when a task exhausted its timeslice it moved to the expired array. When the active array was empty, the two arrays were swapped by pointer exchange.

This was clever, and it worked well for servers. But it had problems.

Interactive responsiveness was handled through heuristics. The scheduler tried to identify "interactive" tasks (those that sleep frequently, like GUI apps) and boost their priority. These heuristics were fragile. A task that happened to sleep at the wrong intervals could be misclassified. Workloads that mixed CPU-bound and I/O-bound behavior in unusual patterns caused the heuristics to make bad decisions. Tuning the heuristics for one workload broke another.

The fundamental issue: the O(1) scheduler was trying to infer task behavior from indirect signals, then applying priority adjustments based on those inferences. It was a heuristic tower, and heuristic towers eventually collapse under edge cases.

Context Switching: The Hidden Cost

Before we move on to CFS, it's worth understanding what actually happens when the scheduler decides to run a different task. The context switch is the scheduler's core mechanical operation, and its cost shapes every scheduling decision.

When the kernel switches from task A to task B on a core, it executes a precise sequence of steps. First, it saves task A's register state: the general purpose registers (RAX through R15 on x86-64), the stack pointer (RSP), the instruction pointer (RIP), and the flags register. All of this goes into task A's thread_struct in the kernel.

Next comes the FPU and vector register state. Modern x86 processors have a large SIMD register file (the YMM or ZMM registers for AVX/AVX-512 can be up to 2KB of state). On older kernels, this was handled lazily: the kernel would mark the FPU as "not owned" and only save/restore the state if the new task actually used floating point, triggering a #NM (Device Not Available) exception on first FPU access. Modern kernels (with CONFIG_X86_DEBUG_FPU and related options) use eager saving, performing the XSAVE/XRSTOR on every switch. Eager mode is simpler, avoids the exception overhead, and is necessary for security (to prevent leaking FPU state between tasks).

Then the kernel switches the page tables by writing the new task's top-level page table address to the CR3 register. This is the most consequential step. On processors without PCID (Process Context ID) support, writing CR3 flushes the entire TLB (Translation Lookaside Buffer), meaning every virtual-to-physical address mapping cached in hardware is thrown away. The new task starts with a cold TLB and takes a page walk (potentially 4 levels of memory reads) on every memory access until the working set is re-cached. Processors with PCID support can tag TLB entries with a process identifier, allowing entries from multiple address spaces to coexist in the TLB. This avoids the flush, and it's a significant performance win on workloads with frequent context switches.

Finally, the kernel restores task B's register state and returns to userspace. The task resumes exactly where it left off.

The direct cost of this entire sequence is roughly 1-5 microseconds on modern hardware. That's not nothing, but it's not the real problem. The indirect cost is far worse.

When a task starts running on a core, its working set (the code and data it accesses frequently) lives in the L1 and L2 caches. A context switch doesn't flush these caches, but the new task's memory accesses gradually evict the old task's cache lines. When the original task runs again, its data is gone and every access is a cache miss until the working set is reloaded. This cache pollution penalty can be 10-100x more expensive than the register save/restore, depending on the working set size.

It gets worse when a task migrates to a different core entirely. The L1 and L2 caches are per-core, so migrating means starting completely cold. Even the L3 (shared across cores on the same socket) might not help if the data was only in L1/L2. This is why the scheduler tries hard to keep tasks pinned to the same core via the wake_affine heuristic, and why cross-NUMA migrations are treated as especially expensive.

The practical consequence: the scheduler isn't just choosing which task to run. It's constantly weighing the cost of switching against the benefit of fairness. Every additional context switch per second is a tax on overall system throughput.

CFS: The Completely Fair Scheduler (Linux 2.6.23, 2007)

Ingo Molnar replaced his own scheduler with CFS in 2007. The core insight was to stop guessing about task behavior and instead enforce a simple invariant: every task should get a fair share of CPU time proportional to its weight, and the task that has received the least relative CPU time should run next.

Virtual Runtime

CFS tracks each task's virtual runtime (vruntime), a monotonically increasing value that represents how much CPU time the task has consumed, normalized by its weight. When the scheduler needs to pick the next task, it picks the one with the lowest vruntime.

The key formula:

vruntime += delta_exec * (NICE_0_WEIGHT / task_weight)

Where delta_exec is wall-clock time spent running, NICE_0_WEIGHT is 1024 (the weight for nice 0), and task_weight depends on the task's nice value.

A task with a higher weight (lower nice value, higher priority) accumulates vruntime more slowly, so it stays at the front of the queue longer. A low-priority task's vruntime increases faster, pushing it to the back.

Nice Values and Weights

The mapping from nice values to weights is exponential, not linear. Each nice level represents roughly a 10% difference in CPU share:

static const int sched_prio_to_weight[40] = {
 /* -20 */ 88761, 71755, 56483, 46273, 36291,
 /* -15 */ 29154, 23254, 18705, 14949, 11916,
 /* -10 */  9548,  7620,  6100,  4904,  3906,
 /*  -5 */  3121,  2501,  1991,  1586,  1277,
 /*   0 */  1024,   820,   655,   526,   423,
 /*   5 */   335,   272,   215,   172,   137,
 /*  10 */   110,    87,    70,    56,    45,
 /*  15 */    36,    29,    23,    18,    15,
};

A nice -20 task has weight 88761. A nice 19 task has weight 15. That means the high-priority task accumulates vruntime roughly 5917x slower. In practice, if both are running, the nice -20 task gets approximately 99.98% of CPU time.

The Red-Black Tree

CFS stores all runnable tasks in a red-black tree, keyed by vruntime. The leftmost node (lowest vruntime) is cached for O(1) access. Insertion and removal are O(log n). This is the key data structure:

struct cfs_rq {
    struct rb_root_cached tasks_timeline;  // rb-tree of sched_entity
    u64              min_vruntime;         // monotonically increasing floor
    struct sched_entity *curr;            // currently running entity
    // ...
};

Timeslices and Granularity

CFS doesn't assign fixed timeslices. Instead, it uses a scheduling period (sysctl_sched_latency, default 6ms for up to 8 tasks). Each task gets a proportional share of this period based on its weight. If you have 4 equal-weight tasks, each gets 1.5ms per period.

There's a floor: sysctl_sched_min_granularity (default 0.75ms). This prevents the scheduler from context-switching too aggressively when many tasks are runnable. If you have 100 tasks, the period stretches to 100 * 0.75ms = 75ms rather than giving each task 0.06ms of a 6ms period.

$ sysctl kernel.sched_latency_ns
kernel.sched_latency_ns = 6000000
 
$ sysctl kernel.sched_min_granularity_ns
kernel.sched_min_granularity_ns = 750000

Sleeping and Waking

When a task sleeps, it leaves the runqueue. When it wakes up, its vruntime might be far behind the current min_vruntime, which would let it monopolize the CPU while it "catches up." CFS prevents this by setting a waking task's vruntime to at most min_vruntime - sysctl_sched_latency. This gives newly woken tasks a small bonus (they'll be scheduled soon) without allowing them to starve other tasks.

This "sleeper fairness" mechanism was one of the trickiest parts of CFS. It had to balance two goals: reward tasks that voluntarily yielded the CPU (typically interactive tasks) without creating exploitable starvation patterns.

Where CFS Fell Short

CFS served Linux well for 16 years, but it had known problems.

Latency unpredictability. CFS guarantees fairness over the scheduling period, but it doesn't guarantee when within that period a specific task will run. A latency-sensitive task (audio processing, game rendering) might have to wait for the entire rest of the period before getting its turn. The vruntime ordering doesn't encode urgency, only accumulated unfairness.

Sleeper fairness hacks. The wakeup vruntime adjustment was a heuristic. Aggressive sleepers could game it. The tuning knobs (sysctl_sched_wakeup_granularity) added complexity without fully solving the problem.

Group scheduling complexity. When cgroups entered the picture, CFS had to maintain hierarchical vruntime accounting across scheduling groups. This added significant code complexity and subtle bugs around weight calculation at different levels of the hierarchy.

The Completely Fair Group Scheduler (cgroups)

That last point about group scheduling deserves its own section, because it's how Linux handles the container scheduling problem that dominates modern infrastructure.

Without cgroup-aware scheduling, fairness is per-task. If cgroup A has 100 runnable threads and cgroup B has 1 thread, the single thread in B gets only 1/101 of the CPU. That's "fair" to each individual task, but wildly unfair at the group level: container A is getting 99% of the CPU simply by spawning more threads.

The group scheduler fixes this by enforcing fairness at each level of the cgroup hierarchy. Each cgroup gets its own CFS runqueue with its own min_vruntime. If cgroups A and B have equal weight, they split the CPU 50/50, regardless of internal thread counts. Container A's 100 threads share their group's 50% among themselves. Container B's single thread gets the full other 50%.

The weight hierarchy is multiplicative. A task's effective weight is the product of its own weight and all its ancestor cgroups' weights relative to their siblings. If a task has nice 0 (weight 1024) inside a cgroup with weight 512 that's a sibling of a cgroup with weight 1024, the task's effective share is reduced by the cgroup's lower weight in the hierarchy.

Three cgroup knobs control CPU allocation:

cpu.shares (cgroups v1) or cpu.weight (cgroups v2) sets the proportional weight. This only matters when there's contention. If a cgroup is the only one with runnable tasks, it gets all available CPU regardless of its weight.

cpu.cfs_period_us and cpu.cfs_quota_us (v1) or cpu.max (v2) enforce hard bandwidth limits. The quota defines how many microseconds of CPU time the cgroup can consume within each period. With a period of 100ms and a quota of 50ms, the cgroup is limited to 50% of one core, even if the rest of the system is completely idle.

This bandwidth throttling is the source of the infamous CPU throttling problem in Kubernetes. When a container hits its quota within a period, all its tasks are immediately throttled (moved off the runqueue) until the next period begins. The remaining CPU time in that period goes unused. A container with a cpu.limit of 500m (500 millicores, meaning 50ms per 100ms period) can experience latency spikes every time it exhausts its budget mid-period, even if there are idle cores available. The request comes in, the application starts processing, burns through its 50ms quota, and then sits idle for up to 50ms waiting for the next period to start.

This has led to a widespread Kubernetes best practice of setting CPU requests (which affect scheduling) but omitting CPU limits (which trigger throttling), or at least setting limits much higher than requests. Google's Borg paper noted this same tradeoff years before Kubernetes encountered it.

EEVDF: Earliest Eligible Virtual Deadline First (Linux 6.6, 2023)

EEVDF replaced CFS as the default scheduler for SCHED_NORMAL tasks in Linux 6.6. The algorithm was originally described in a 1995 paper by Stoica and Abdel-Wahab, and Peter Zijlstra implemented it for Linux.

The core improvement over CFS: tasks now have virtual deadlines, and scheduling decisions consider both fairness and urgency.

How EEVDF Works

Each task has three key values:

  • Virtual time (vruntime): Same concept as CFS, tracks accumulated weighted CPU usage.
  • Request (time slice): How much CPU time the task is requesting for its next execution.
  • Virtual deadline: vruntime + (request / weight). The point in virtual time by which the task "should" have completed its requested slice.

The scheduling decision has two steps:

  1. Eligibility check: A task is eligible to run only if its virtual time is at or behind the queue's virtual time. This prevents tasks that have already received more than their fair share from being scheduled.
  2. Deadline selection: Among all eligible tasks, pick the one with the earliest virtual deadline.
struct sched_entity {
    u64                  vruntime;
    u64                  deadline;
    u64                  min_vruntime;      // subtree minimum
    s64                  vlag;              // fairness lag
    u64                  slice;             // requested time slice
    // ...
};

The Lag Concept

EEVDF introduces lag: the difference between how much CPU time a task was entitled to and how much it actually received. A positive lag means the task has been underserved. A negative lag means it has been overserved.

lag(t) = S(t) * weight_i / total_weight - actual_runtime_i

Where S(t) is the total system virtual time elapsed. Lag directly determines eligibility. A task with positive lag is eligible (it's owed CPU time). A task with negative lag is not (it's had more than its share).

Why EEVDF Is Better

The key insight is that virtual deadlines encode latency requirements naturally, without heuristics.

A task that requests a small time slice gets an earlier deadline. Interactive tasks, which typically run for short bursts, naturally get earlier deadlines and thus lower scheduling latency. A CPU-bound task requesting a long slice gets a later deadline, so it won't preempt the interactive task.

This eliminates the need for CFS's sleeper fairness hacks. There's no special wakeup bonus logic. The algorithm's structure inherently favors short-burst tasks without any explicit detection of "interactive" behavior.

The eligibility check prevents the fairness inversions that CFS could exhibit. In CFS, a high-weight task that just woke up could have a very low vruntime and monopolize the CPU. In EEVDF, it has to be eligible first.

EEVDF in Practice

Scheduler Latency in Practice: The Desktop Problem

Here's a scenario that frustrates desktop Linux users. You kick off make -j$(nproc) to compile a large project. While it runs, your browser stutters, window animations drop frames, and typing in a terminal feels laggy. The scheduler is being "fair," so why does it feel broken?

Because fairness and latency are different things. If you have 100 compiler processes and 1 browser process, all running at the same nice level under SCHED_NORMAL, the browser gets 1/101 of the CPU time. That's fair. But it also means the browser might wait for up to 100 time slices before getting scheduled, which can translate to 50-100ms of delay. Human perception notices latency above about 10ms for input response and 16ms for visual smoothness.

EEVDF helps significantly here. The browser's rendering thread runs in short bursts (a few hundred microseconds to paint a frame), which means it gets short time slice requests and therefore earlier virtual deadlines. The compiler threads request long slices and get later deadlines. So the browser thread will naturally be selected before compiler threads when it wakes up. But it's not magic: on a heavily loaded system with hundreds of runnable tasks, there's still contention for the physical cores, and EEVDF's improvement is bounded by the minimum granularity.

For users who want more control, several tools exist. schedtool and chrt can assign real-time priority to specific processes, though this requires care (a buggy SCHED_FIFO process can lock up a core). The gamemode daemon, popular in the Linux gaming community, automatically adjusts scheduling parameters, I/O priority, and CPU governor settings when a game launches, then reverts when it exits.

The PipeWire audio server takes a more direct approach: its audio processing threads run under SCHED_FIFO (or SCHED_DEADLINE where supported) to guarantee that audio buffer processing always completes before the hardware needs the next buffer. This is why PipeWire largely solved the audio latency problems that plagued PulseAudio on loaded systems.

There's an ongoing debate in the kernel community about whether Linux needs a dedicated "desktop" scheduling mode, or whether EEVDF with proper tuning knobs is sufficient. Projects like CachyOS and TKG ship patched kernels with alternative schedulers (like BORE, which adds "burstiness" detection), suggesting that at least some users think the mainline scheduler still leaves desktop responsiveness on the table.

The data structure is still a red-black tree (augmented with subtree deadline information for efficient "earliest eligible deadline" queries). The algorithmic complexity is the same as CFS: O(log n) for most operations. The improvement is in scheduling quality, not asymptotic performance.

You can see EEVDF-specific fields in /proc/[pid]/sched:

$ cat /proc/1/sched
...
se.vruntime                      :         29438.102926
se.deadline                      :         29442.102926
se.slice                         :             4.000000
se.vlag                          :            -0.235817
...

Scheduling Classes: The Full Hierarchy

The scheduler isn't monolithic. Linux uses a hierarchy of scheduling classes, each implementing its own policy. The scheduler walks the classes from highest to lowest priority, running the first class that has a runnable task.

SCHED_DEADLINE    (EDF - Earliest Deadline First, highest priority)
    |
SCHED_FIFO / SCHED_RR    (POSIX real-time)
    |
SCHED_NORMAL      (EEVDF, formerly CFS - where most tasks live)
    |
SCHED_BATCH       (same algorithm as NORMAL, but never preempts)
    |
SCHED_IDLE        (only runs when nothing else wants the CPU)

SCHED_DEADLINE is the hard real-time class. You specify a runtime, deadline, and period, and the kernel guarantees the task gets its runtime within each period using EDF scheduling. It's used for things like industrial control and professional audio pipelines.

SCHED_FIFO and SCHED_RR are the POSIX real-time policies. FIFO tasks run until they voluntarily yield or a higher-priority real-time task preempts them. RR adds round-robin timeslicing among tasks of equal priority.

The vast majority of processes on your system are SCHED_NORMAL.

SCHED_DEADLINE Deep Dive

The SCHED_DEADLINE class at the top of the hierarchy deserves more attention, because it's the only Linux scheduling policy that provides actual temporal guarantees.

Each SCHED_DEADLINE task declares three parameters when it's created:

  • Runtime: how much CPU time the task needs per period (e.g., 500us).
  • Deadline: the relative deadline by which the runtime must be delivered (e.g., 1ms).
  • Period: how often the task repeats (e.g., 2ms).

The kernel uses Earliest Deadline First (EDF) scheduling among all SCHED_DEADLINE tasks. At any given moment, the task with the soonest absolute deadline runs. When a task exhausts its runtime budget, it's throttled until the next period begins, even if no other task wants the CPU.

You configure this through the sched_setattr() syscall:

struct sched_attr {
    __u32 size;
    __u32 sched_policy;       // SCHED_DEADLINE
    __u64 sched_flags;
    __s32 sched_nice;         // unused for deadline
    __u32 sched_priority;     // unused for deadline
    __u64 sched_runtime;      // nanoseconds
    __u64 sched_deadline;     // nanoseconds
    __u64 sched_period;       // nanoseconds
};
 
struct sched_attr attr = {
    .size = sizeof(attr),
    .sched_policy = SCHED_DEADLINE,
    .sched_runtime  =  500000,   // 500us
    .sched_deadline = 1000000,   // 1ms
    .sched_period   = 2000000,   // 2ms
};
sched_setattr(0, &attr, 0);

The critical safety mechanism is the admission test. When you call sched_setattr(), the kernel checks whether the total CPU utilization of all deadline tasks (the sum of runtime / period across all SCHED_DEADLINE tasks on each core) stays below a threshold, which defaults to roughly 95%. If adding your task would exceed this, the syscall fails with EBUSY. This prevents you from overcommitting the CPU and creating a situation where deadline guarantees become impossible to honor.

# The admission threshold is controlled by:
cat /proc/sys/kernel/sched_rt_runtime_us    # default 950000
cat /proc/sys/kernel/sched_rt_period_us     # default 1000000
# Ratio = 950000/1000000 = 95%

Real-world use cases for SCHED_DEADLINE are narrow but important. The JACK audio server can use it to guarantee that audio processing completes within each buffer period, eliminating the xruns (buffer underruns) that plague audio on standard Linux. Industrial PLC (Programmable Logic Controller) replacements running on Linux need hard timing guarantees for control loops. Some low-latency trading systems use it to ensure market data processing completes within a bounded time window.

The EDF algorithm is provably optimal for single-processor deadline scheduling: if any algorithm can schedule a set of tasks without missing deadlines, EDF can too. On multiprocessor systems this optimality doesn't hold (the problem becomes NP-hard), but in practice, with per-CPU admission control and careful task placement, SCHED_DEADLINE works well for real-time workloads that would otherwise require a dedicated RTOS.

Multicore Scheduling and Load Balancing

Everything described above runs per-CPU. Each core has its own runqueue, its own red-black tree, its own min_vruntime. The scheduler's hot path (pick next task, context switch) only touches local per-CPU data, which is critical for scalability on large machines.

But work has to be distributed across cores. This is load balancing, and it's arguably harder than the single-core scheduling problem.

Linux organizes CPUs into scheduling domains: hierarchical groups that reflect the hardware topology. Cores sharing an L2 cache form one domain. All cores on a socket form another. Sockets on the same NUMA node form yet another.

NUMA node 0           NUMA node 1
  Socket 0              Socket 1
  [Core0 Core1]         [Core4 Core5]
  [Core2 Core3]         [Core6 Core7]

Load balancing runs at each level of the domain hierarchy, with increasing reluctance at higher levels. Migrating a task between cores sharing L2 is cheap. Migrating across sockets invalidates caches. Migrating across NUMA nodes means the task's memory is now remote, adding latency to every memory access.

The balancer considers both load (how much work each core has) and the cost of migration. It uses periodic balance ticks and also triggers on specific events (a core going idle, a task waking up).

The wake_affine heuristic tries to wake a task on the same core (or a nearby core) that it last ran on, preserving cache warmth. This single optimization has a measurable impact on many workloads.

The Idle Task and Power Management

What happens when a core has absolutely nothing to run? It doesn't just spin in a loop waiting. The kernel runs a special per-core idle task (swapper/N for core N), and this task's job is to put the core to sleep as efficiently as possible.

The idle task invokes the cpuidle governor, which decides how deep the core should sleep. x86 processors define a hierarchy of power states (C-states):

  • C0: Active. The core is executing instructions normally.
  • C1 (halt): The core stops executing but retains all state. Wake latency is roughly 1 microsecond.
  • C1E (enhanced halt): Same as C1 but at lower voltage. Slightly deeper savings, slightly longer wake time.
  • C3 (sleep): The L1 cache is flushed and the core clock is gated. Wake latency around 30-50 microseconds.
  • C6 (deep sleep): The L2 cache is flushed and core voltage is dropped near zero. Wake latency of 100+ microseconds.

Deeper C-states save more power, but the wake-up cost directly impacts scheduling latency. If a task wakes up and needs to run on a core in C6, it has to wait 100+ microseconds for the core to power back on, restore its caches, and begin executing. For latency-sensitive workloads, that's unacceptable.

The kernel ships two cpuidle governors that make this tradeoff. The menu governor (default on most systems) uses a prediction model based on expected sleep duration, recent interrupt patterns, and performance multipliers. The teo (Timer Events Oriented) governor, which is simpler and sometimes more accurate, makes predictions primarily from upcoming timer events. Both try to pick the deepest C-state that won't cause a painful wake-up when the next event arrives.

For real-time or latency-sensitive workloads, you can override these decisions:

# Limit max C-state at boot (0 = C0 only, 1 = up to C1, etc.)
# Kernel parameter:
processor.max_cstate=1
 
# Or at runtime via PM QoS (request max 0us latency, effectively disabling deep sleep)
echo 0 > /dev/cpu_dma_latency

There's an interesting interplay with Turbo Boost and frequency scaling. When cores enter deep sleep states, they free up thermal and power budget. The remaining active cores can use that headroom to boost to higher frequencies. A system running two busy threads on a 16-core processor might see those two cores boosted significantly above base clock, precisely because the other 14 cores are in C6. The scheduler is aware of this: it prefers to pack work onto fewer cores rather than spread it thin, because concentrated work plus deep idle gives both better per-thread performance and lower total power consumption.

Practical Tooling

Inspecting Scheduler State

Per-task scheduling info:

# Scheduler statistics for a specific PID
cat /proc/<pid>/sched
 
# System-wide scheduler statistics
cat /proc/schedstat
 
# What scheduling policy a process is using
chrt -p <pid>

Setting Scheduling Policy

# Run a command with real-time FIFO priority 50
chrt -f 50 ./my_realtime_app
 
# Run with nice value -5 (higher priority)
nice -n -5 ./my_app
 
# Change nice value of running process
renice -n -5 -p <pid>
 
# Set SCHED_BATCH
chrt -b 0 ./my_batch_job
 
# Set SCHED_DEADLINE (runtime/deadline/period in nanoseconds)
chrt -d --sched-runtime 500000 --sched-deadline 1000000 \
        --sched-period 2000000 ./my_deadline_app

Kernel Configuration

Several compile-time options affect scheduler behavior:

  • CONFIG_HZ (100, 250, 300, 1000): Timer interrupt frequency. Higher values mean finer scheduling granularity but more overhead. Desktop kernels typically use 1000, servers use 250.
  • CONFIG_PREEMPT_NONE: No kernel preemption. Best throughput, worst latency.
  • CONFIG_PREEMPT_VOLUNTARY: Preemption at explicit yield points. A balance for servers.
  • CONFIG_PREEMPT: Full kernel preemption. Best latency, some throughput cost. Standard for desktop kernels.
  • CONFIG_PREEMPT_RT: The PREEMPT_RT patchset (mainline since 6.12). Makes nearly all kernel code preemptible. Used for hard real-time requirements.

Preemption Models Explained

The preemption config options above deserve a deeper look, because they fundamentally change how the kernel behaves under load.

PREEMPT_NONE means kernel code runs to completion. When a userspace process makes a syscall, that kernel code path runs until it finishes, regardless of what else is waiting. If a filesystem operation takes 10ms to complete, the core is locked up for the full 10ms. No other task can run on that core, even if a high-priority real-time task just woke up. This gives the best throughput because there's zero preemption overhead, but worst-case latency can be terrible. It's the right choice for dedicated server workloads where throughput matters more than responsiveness.

PREEMPT_VOLUNTARY adds explicit cond_resched() calls at known long-running points in the kernel. These are manual checkpoints scattered throughout the code that effectively ask "has something more important shown up while I've been running?" If so, the current task yields. This catches the worst offenders (long memory copies, filesystem traversals, device driver loops) without adding overhead everywhere. It's a pragmatic middle ground, and it's what RHEL and many server-oriented distributions use.

PREEMPT (full kernel preemption) allows the scheduler to preempt kernel code at almost any point. The only exceptions are code holding a spinlock, running inside a hardware interrupt handler, or in an explicitly marked critical section. Every spinlock release and interrupt return becomes a potential rescheduling point. This means a high-priority task that wakes up will typically run within microseconds, even if the core was in the middle of a syscall. The cost is real but modest: the preemption checks add overhead to every spinlock operation, and more frequent context switches mean more cache disruption. Ubuntu, Fedora, and Arch all default to this model for their desktop kernels.

PREEMPT_RT (fully preemptible, mainlined in kernel 6.12 after nearly two decades of development) takes a fundamentally different approach. Spinlocks are converted to sleeping mutexes (called rt_mutex), which means code holding a "spinlock" can itself be preempted. Hardware interrupt handlers are forced into kernel threads, so they can be scheduled and prioritized like any other task. With these changes, virtually all kernel code becomes preemptible. The result: worst-case scheduling latency drops from milliseconds to single-digit microseconds. The throughput cost is significant (roughly 5-15% on some workloads) because sleeping mutexes are heavier than spinlocks and the threaded interrupt model adds context switch overhead. But for industrial control systems, professional audio production, and high-frequency trading, deterministic latency matters more than peak throughput.

The choice propagates through the entire system. A kernel compiled with PREEMPT_NONE might show 10ms worst-case latency on a loaded system. The same hardware with PREEMPT might show 1ms. With PREEMPT_RT, you can reliably hit 10-50 microseconds.

Runtime Tuning

Key sysctls (note: some of these changed or were removed with EEVDF):

# CFS scheduling period (may not exist on EEVDF kernels)
sysctl kernel.sched_latency_ns
 
# Minimum granularity per task
sysctl kernel.sched_min_granularity_ns
 
# NUMA balancing
sysctl kernel.numa_balancing

How to Debug Scheduling Problems

When something feels slow and you suspect the scheduler, don't guess. Linux has excellent tooling for capturing exactly what the scheduler is doing.

perf sched is the starting point. Record scheduling events across the system, then analyze them:

# Record scheduling events for 10 seconds
perf sched record -- sleep 10
 
# Show per-task scheduling latency (time between wakeup and actually running)
perf sched latency
 
# Output looks like:
#  Task                  |   Runtime ms  | Switches | Avg delay ms | Max delay ms |
#  firefox              |    312.450 ms |      847 |    0.082 ms  |    4.231 ms  |
#  kworker/3:0          |      8.120 ms |      203 |    0.014 ms  |    0.892 ms  |

The "Max delay" column is what matters for latency-sensitive work. If your audio thread has a max delay of 4ms and your buffer is 5ms, you're one spike away from a glitch.

trace-cmd captures all scheduler tracepoints with minimal overhead:

# Record all sched events
trace-cmd record -e sched -e irq sleep 5
 
# View the trace
trace-cmd report | less
 
# Filter for a specific task
trace-cmd report | grep "firefox"

The sched_switch tracepoint shows every context switch: which task was running, which task replaced it, and why the previous task stopped (blocked on I/O, preempted, yielded). The sched_wakeup tracepoint shows when a task becomes runnable, and sched_migrate_task shows when a task moves between cores.

cyclictest is the standard benchmark for measuring worst-case scheduling latency. It's part of the rt-tests package:

# Run on all cores, 1000us interval, 10000 loops, SCHED_FIFO priority 99
cyclictest -a -t -n -i 1000 -l 10000 -p 99
 
# Output:
# T: 0 ( 1842) P:99 I:1000 C:  10000 Min:      1 Act:    3 Avg:    2 Max:      18
# T: 1 ( 1843) P:99 I:1000 C:  10000 Min:      1 Act:    2 Avg:    2 Max:      42

The "Max" value is the worst-case latency in microseconds. On a standard kernel, you might see values in the hundreds of microseconds. On PREEMPT_RT, single-digit microseconds is typical. If Max is in the milliseconds, something is wrong: a kernel driver holding a lock too long, an interrupt storm, or a misconfigured power management state.

bpftrace gives you custom scheduler analysis without recompiling anything:

# Histogram of runqueue latency (time from wakeup to running)
bpftrace -e 'tracepoint:sched:sched_wakeup { @start[args->pid] = nsecs; }
tracepoint:sched:sched_switch /args->next_pid/ {
  if (@start[args->next_pid]) {
    @usecs = hist((nsecs - @start[args->next_pid]) / 1000);
    delete(@start[args->next_pid]);
  }
}'
 
# Count context switches per process
bpftrace -e 'tracepoint:sched:sched_switch { @[args->next_comm] = count(); }'

ftrace, the kernel's built-in tracer, is always available even when other tools aren't installed:

# Enable the sched_switch tracepoint
echo 1 > /sys/kernel/debug/tracing/events/sched/sched_switch/enable
cat /sys/kernel/debug/tracing/trace_pipe

A practical debugging workflow: start with perf sched latency to identify which tasks have high scheduling delays. Use trace-cmd to capture a window around the problem. Look at the sched_switch events to see what was running instead of your latency-sensitive task. If a kernel thread or interrupt handler is the culprit, ftrace with function tracing can show you exactly what kernel code path is holding things up.

The Takeaway

The Linux scheduler's evolution tells a clear story. The O(1) scheduler solved the scalability problem but relied on heuristics for quality. CFS replaced heuristics with a clean fairness model but couldn't express latency requirements. EEVDF keeps CFS's fairness foundation and adds deadline-awareness, giving the scheduler a principled way to handle latency-sensitive tasks without special-case logic.

If you're running kernel 6.6 or later, you're already on EEVDF. If you're tuning a workload and wondering why scheduling behavior changed after a kernel upgrade, this is probably why. The fairness model is the same, but the task selection logic now accounts for deadlines, and most of the old sleeper-fairness tunables are gone.

The scheduler is one of those subsystems where reading the code is genuinely rewarding. Start with kernel/sched/fair.c and kernel/sched/sched.h. The data structures are elegant, the comments are thorough, and the commit history is a masterclass in incremental systems improvement.