crucible#1058
and the
motivation for that refactoring. As such, statements in the present tense below
("The Crucible upstairs is implemented as a set of async tasks") refer to the
pre-refactoring architecture; the current architecture is described in the
section titled
Counter-Proposal: One Big Task.The Crucible upstairs is implemented as a set of async tasks. These tasks are mostly static, though a few may be spawned at runtime (notably live-repair). The tasks communicate through a mix of message-passing and shared data behind either synchronous or async locks. Even when message-passing is used, it is often a "doorbell" that wakes a task and tells it to check some mutex-protected data structure.
A BlockOp
request ("job") normally passes through eight tasks:
(Dotted lines are a reminder that the "fast ack" optimization means that writes don’t wait for the Downstairs; instead, those jobs are marked as ackable right away. Of course, reads and flushes must go Downstairs)
These tasks are mostly manipulating data stored in a single
tokio::sync::Mutex<Downstairs>
. A single job locks that mutex many times:
1× in
up_listen
inprocess_new_io
→submit_*
3× + 3× in
cmd_loop
→io_send
The first (3×) lock is amortized if multiple jobs are available at the same time
3× in
process_ds_operation
(called inpm_task
)3× + 1× in
up_ds_listen
to ack work to the guestThe first (3×) lock is amortized if multiple jobs are ackable at the same time
In other words, we have to lock the mutex between 11 and 15 times to process a
single job. The locks in io_send
are particularly troublesome, because
they’re highly contended: all 3× downstairs are trying to send data
simultaneously, and the lock forces them to take turns.
With this lock contention, splitting the downstairs client work between multiple
jobs doesn’t actually buy us anything. Running all three cmd_loop
tasks
together using FuturesUnordered
actually showed 1% faster performance in an
quick (unscientific) benchmark!
What data is in the system?
The Crucible upstairs stores a bunch of state, which we can group into a few different categories:
Singleton data
Upstairs state (
UpstairsState
)List of ackable jobs (stored separately, as an optimization to skip iterating over every job and checking if it is ackable)
Guest work map
Global live-repair (e.g. assigned job IDs)
Per-client data
Client state (
DsState
)Statistics (
IOStateCount
,downstairs_errors
,live_repair_completed/
, etc)aborted Last flush (as a
JobId
)New and skipped jobs
Live-repair data (extents limit)
Per-job data
Various IDs
Actual job work (
IOop
)Whether the job is ackable
Whether the job is a replay
Job data (read response, read response hashes)
Per-job + per-client data
IO state (
IOState
)
I went through and hand-annotated which functions use which data, then wrote a Python script to convert it into a per-task table. Since this table is hand-assembled, it may not be 100% correct, but it’s at least mostly right.
Variables ending in a [i]
indicate that only one client’s data is being
accessed; [*]
indicates that multiple clients' data is used.
|
|
|
|
| |
| R/W | R/W | R/W | — | R/W |
| R/W | R/W | R/W | — | R/W |
| R/W | R/W | R/W | — | R/W |
| — | — | R/W | — |