Description
When using the NAS backup plugin on KVM, if a backup job fails (e.g. due to backup storage being full or I/O errors on the NFS target), the VM remains indefinitely paused at the hypervisor level. CloudStack marks the backup as Error but does not resume the VM, leaving it unresponsive until manually resumed via virsh resume.
Steps to Reproduce
- Configure NAS backup with NFS storage for a running KVM VM
- Fill up the NFS backup storage to 100% capacity
- Wait for the scheduled backup to trigger
- Observe the VM becomes paused and never resumes
Expected Behavior
The VM should be automatically resumed after a backup failure. The backup should be marked as failed, but the VM should continue running normally.
Actual Behavior
The VM remains in a paused state indefinitely. The backup monitoring loop in nasbackup.sh enters an infinite cycle:
virsh backup-begin pauses the QEMU domain for consistent snapshot
- Backup write fails (storage full)
domjobinfo reports Failed status
cleanup() is called but does not resume the VM
- No
exit statement after cleanup — loop continues, repeatedly detecting the failed job
Root Cause Analysis
Three bugs in scripts/vm/hypervisor/kvm/nasbackup.sh:
Bug 1: Missing exit after failed backup cleanup (line 144)
case "$status" in
Failed)
echo "Virsh backup job failed"
cleanup ;; # <-- no exit, falls through to sleep and loops forever
esac
Bug 2: cleanup() never resumes the VM (line 222)
The cleanup() function only removes files and unmounts storage. It never checks if the VM is paused or attempts to resume it, even though virsh backup-begin may have paused the domain.
Bug 3: Missing exit in backup_stopped_vm() (line 181)
Similar to Bug 1, backup_stopped_vm() calls cleanup() on qemu-img convert failure but does not exit, allowing the loop to continue processing subsequent disks.
Impact
- Production outage: All services on the affected VM become unresponsive
- Cascading failures: When backup storage fills up, ALL VMs being backed up get paused simultaneously
- Silent failure: CloudStack UI shows the VM as "Running" while it is actually paused at the KVM level
- No automatic recovery: Manual intervention (
virsh resume) is required per VM
In our environment, NFS backup storage filling to 100% caused 8 production VMs to become paused simultaneously across 3 KVM hosts, with some VMs remaining paused for over 6 hours before detection.
Environment
- CloudStack 4.19/4.20/main (code is unchanged across versions)
- KVM hypervisor
- NAS backup plugin with NFS storage
- File:
scripts/vm/hypervisor/kvm/nasbackup.sh
Proposed Fix
PR forthcoming with the following changes:
- Add VM state check and
virsh resume to cleanup() function
- Add missing
exit 1 after cleanup() in the Failed backup job case
- Add missing
exit 1 after cleanup() in backup_stopped_vm() on qemu-img convert failure
Description
When using the NAS backup plugin on KVM, if a backup job fails (e.g. due to backup storage being full or I/O errors on the NFS target), the VM remains indefinitely paused at the hypervisor level. CloudStack marks the backup as
Errorbut does not resume the VM, leaving it unresponsive until manually resumed viavirsh resume.Steps to Reproduce
Expected Behavior
The VM should be automatically resumed after a backup failure. The backup should be marked as failed, but the VM should continue running normally.
Actual Behavior
The VM remains in a
pausedstate indefinitely. The backup monitoring loop innasbackup.shenters an infinite cycle:virsh backup-beginpauses the QEMU domain for consistent snapshotdomjobinforeportsFailedstatuscleanup()is called but does not resume the VMexitstatement after cleanup — loop continues, repeatedly detecting the failed jobRoot Cause Analysis
Three bugs in
scripts/vm/hypervisor/kvm/nasbackup.sh:Bug 1: Missing exit after failed backup cleanup (line 144)
Bug 2: cleanup() never resumes the VM (line 222)
The
cleanup()function only removes files and unmounts storage. It never checks if the VM is paused or attempts to resume it, even thoughvirsh backup-beginmay have paused the domain.Bug 3: Missing exit in backup_stopped_vm() (line 181)
Similar to Bug 1,
backup_stopped_vm()callscleanup()onqemu-img convertfailure but does not exit, allowing the loop to continue processing subsequent disks.Impact
virsh resume) is required per VMIn our environment, NFS backup storage filling to 100% caused 8 production VMs to become paused simultaneously across 3 KVM hosts, with some VMs remaining paused for over 6 hours before detection.
Environment
scripts/vm/hypervisor/kvm/nasbackup.shProposed Fix
PR forthcoming with the following changes:
virsh resumetocleanup()functionexit 1aftercleanup()in theFailedbackup job caseexit 1aftercleanup()inbackup_stopped_vm()onqemu-img convertfailure