Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion contributing/RUNS-AND-JOBS.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,9 @@ Services' run lifecycle has some modifications:
## Job's Lifecycle

- STEP 1: A newly submitted job has status `SUBMITTED`. It is not assigned to any instance yet.
- STEP 2: `JobSubmittedPipeline` tries to assign an existing instance or provision new capacity.
- STEP 2: `JobSubmittedPipeline` assigns the job in two phases:
- Assignment: claim an existing instance or reserve a *placeholder* `InstanceModel`. Placeholders are `PENDING` instances that reserve an `instance_num` and a `nodes.max` slot. `InstancePipeline` ignores them.
- Provisioning: reuse the existing instance, or cloud-provision and promote the placeholder to `PROVISIONING`.
- On success, the job becomes `PROVISIONING`.
- On failure, the job becomes `TERMINATING`. `JobTerminatingPipeline` later assigns the final failed status.
- STEP 3: `JobRunningPipeline` processes `PROVISIONING`, `PULLING`, and `RUNNING` jobs.
Expand Down
11 changes: 9 additions & 2 deletions src/dstack/_internal/server/background/pipeline_tasks/fleets.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,10 @@
is_fleet_empty,
is_fleet_in_use,
)
from dstack._internal.server.services.instances import instance_matches_constraints
from dstack._internal.server.services.instances import (
instance_matches_constraints,
is_placeholder_instance,
)
from dstack._internal.server.services.locking import get_locker
from dstack._internal.server.services.pipelines import PipelineHinterProtocol
from dstack._internal.server.utils import sentry_utils
Expand Down Expand Up @@ -935,8 +938,12 @@ def _select_current_master_instance_id(
return instance_model.id

# Prefer existing surviving instances over freshly planned replacements to
# avoid election churn during min-nodes backfill.
# avoid election churn during min-nodes backfill. Skip placeholders —
# they have no JPD and cannot anchor cluster placement, so electing one
# just defers the real master decision.
for instance_model in surviving_instance_models:
if is_placeholder_instance(instance_model):
continue
if (
_get_effective_instance_status(
instance_model,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -179,6 +179,13 @@ async def fetch(self, limit: int) -> list[InstancePipelineItem]:
InstanceModel.compute_group_id.is_not(None),
)
),
# Skip placeholder instances managed by JobSubmittedPipeline.
not_(
and_(
InstanceModel.status == InstanceStatus.PENDING,
InstanceModel.provisioning_job_id.is_not(None),
)
),
InstanceModel.deleted == False,
or_(
# Process fast-moving instances (pending, provisioning, terminating)
Expand Down
Loading
Loading