Skip to content

Fix snapshot chaining on Xen#12597

Merged
sureshanaparti merged 4 commits intoapache:4.22from
scclouds:fix-snapshot-chain-on-xen
Apr 10, 2026
Merged

Fix snapshot chaining on Xen#12597
sureshanaparti merged 4 commits intoapache:4.22from
scclouds:fix-snapshot-chain-on-xen

Conversation

@JoaoJandre
Copy link
Copy Markdown
Contributor

Description

This PR fixes #12524.

After the introduction of the Hidden state, the snapshot chain calculation no longer works as expected for XenServer as it does not consider hidden snapshots as part of the chain, possibly leading to unending chains. This PR fixes this issue by adding the hidden snapshots to the chain calculation.

This PR also fixes a regression introduced in commit d700e2d which made snapshot deletion impossible.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • Build/CI
  • Test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

In an environment using XCP-ng, I set snapshot.delta.max to 3 and followed the steps below:

  1. Create 3 volume snapshots
  2. Delete the oldest volume snapshot
  3. Create a new volume snapshot

Before the changes, the new snapshot would be part of the old chain. With the changes, the last snapshot is a full snapshot that is not part of the old chain.

@sureshanaparti
Copy link
Copy Markdown
Contributor

@blueorangutan package

@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 5, 2026

Codecov Report

❌ Patch coverage is 8.33333% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 17.61%. Comparing base (3b42fbf) to head (ba600d1).
⚠️ Report is 30 commits behind head on 4.22.

Files with missing lines Patch % Lines
...storage/datastore/db/SnapshotDataStoreDaoImpl.java 0.00% 11 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##               4.22   #12597   +/-   ##
=========================================
  Coverage     17.60%   17.61%           
- Complexity    15660    15661    +1     
=========================================
  Files          5917     5917           
  Lines        531415   531426   +11     
  Branches      64973    64973           
=========================================
+ Hits          93566    93585   +19     
+ Misses       427294   427283   -11     
- Partials      10555    10558    +3     
Flag Coverage Δ
uitests 3.70% <ø> (ø)
unittests 18.68% <8.33%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@sureshanaparti
Copy link
Copy Markdown
Contributor

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes issue #12524 where the snapshot.delta.max configuration was not being respected on XenServer after the introduction of the Hidden state. The fix ensures that hidden snapshots are properly considered when calculating snapshot chain length, preventing unending snapshot chains. It also resolves a regression in snapshot deletion by using the correct NOTIN operator instead of NEQ when excluding multiple states.

Changes:

  • Modified snapshot chain calculation to include Hidden state snapshots alongside Ready state snapshots
  • Added new DAO method to query snapshots by multiple states using the IN operator
  • Fixed snapshot deletion regression by properly using NOTIN operator for multiple state exclusion

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
DefaultSnapshotStrategy.java Updated getSnapshotImageStoreRef to include Hidden snapshots in chain calculation by using the new multi-state query method
SnapshotDataStoreDaoImpl.java Added search builders for NOTIN and IN operations, implemented new multi-state query method, fixed snapshot deletion query
SnapshotDataStoreDao.java Added interface method signature for querying snapshots by multiple states

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✖️ el8 ✖️ el9 ✖️ debian ✖️ suse15. SL-JID 16711

@JoaoJandre
Copy link
Copy Markdown
Contributor Author

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@JoaoJandre a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 16712

@DaanHoogland
Copy link
Copy Markdown
Contributor

@blueorangutan test ol9 xcpng83

@blueorangutan
Copy link
Copy Markdown

@DaanHoogland a [SL] Trillian-Jenkins test job (ol9 mgmt + xcpng83) has been kicked to run smoke tests

@blueorangutan
Copy link
Copy Markdown

[SF] Trillian test result (tid-15385)
Environment: xcpng83 (x2), zone: Advanced Networking with Mgmt server ol9
Total time taken: 67313 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr12597-t15385-xcpng83.zip
Smoke tests completed. 137 look OK, 12 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_events_resource Error 147.58 test_events_resource.py
test_list_system_vms_metrics_history Failure 0.21 test_metrics_api.py
test_list_vms_metrics_history Failure 185.20 test_metrics_api.py
test_01_non_strict_host_anti_affinity Error 187.25 test_nonstrict_affinity_group.py
test_02_non_strict_host_affinity Error 49.96 test_nonstrict_affinity_group.py
test_01_add_primary_storage_disabled_host Error 15.41 test_primary_storage.py
test_01_primary_storage_iscsi Error 0.16 test_primary_storage.py
test_01_primary_storage_nfs Error 0.16 test_primary_storage.py
ContextSuite context=TestStorageTags>:setup Error 0.26 test_primary_storage.py
test_02_list_snapshots_with_removed_data_store Error 12.59 test_snapshots.py
test_02_list_snapshots_with_removed_data_store Error 12.59 test_snapshots.py
test_01_scale_up_verify Failure 436.03 test_vm_autoscaling.py
test_02_update_vmprofile_and_vmgroup Failure 253.78 test_vm_autoscaling.py
test_06_autoscaling_vmgroup_on_project_network Failure 376.08 test_vm_autoscaling.py
test_06_autoscaling_vmgroup_on_project_network Error 376.08 test_vm_autoscaling.py
test_07_autoscaling_vmgroup_on_vpc_network Failure 383.82 test_vm_autoscaling.py
test_07_autoscaling_vmgroup_on_vpc_network Error 383.84 test_vm_autoscaling.py
ContextSuite context=TestVmAutoScaling>:teardown Error 422.26 test_vm_autoscaling.py
test_01_deploy_vm_on_specific_host Error 0.09 test_vm_deployment_planner.py
test_04_deploy_vm_on_host_override_pod_and_cluster Error 0.13 test_vm_deployment_planner.py
test_08_migrate_vm Error 0.07 test_vm_life_cycle.py
test_11_destroy_vm_and_volumes Error 17.91 test_vm_life_cycle.py
test_12_start_vm_multiple_volumes_allocated Error 56.36 test_vm_life_cycle.py
test_13_destroy_and_expunge_vm Error 4.63 test_vm_life_cycle.py
test_01_migrate_vm_strict_tags_success Error 0.21 test_vm_strict_host_tags.py
test_02_migrate_vm_strict_tags_failure Error 0.20 test_vm_strict_host_tags.py
test_01_restore_vm_strict_tags_success Error 0.18 test_vm_strict_host_tags.py
test_02_restore_vm_strict_tags_failure Error 0.24 test_vm_strict_host_tags.py
test_01_scale_vm_strict_tags_success Error 0.25 test_vm_strict_host_tags.py
test_02_scale_vm_strict_tags_failure Error 0.30 test_vm_strict_host_tags.py
test_01_deploy_vm_on_specific_host_without_strict_tags Error 0.20 test_vm_strict_host_tags.py
test_02_deploy_vm_on_any_host_without_strict_tags Error 2.51 test_vm_strict_host_tags.py
test_03_deploy_vm_on_specific_host_with_strict_tags_success Error 0.25 test_vm_strict_host_tags.py
test_04_deploy_vm_on_any_host_with_strict_tags_success Error 5.78 test_vm_strict_host_tags.py
test_05_deploy_vm_on_specific_host_with_strict_tags_failure Failure 0.21 test_vm_strict_host_tags.py
test_01_verify_ipv6_vpc Error 158.66 test_vpc_ipv6.py
test_01_create_redundant_VPC_2tiers_4VMs_4IPs_4PF_ACL Failure 357.76 test_vpc_redundant.py
test_01_create_redundant_VPC_2tiers_4VMs_4IPs_4PF_ACL Error 357.79 test_vpc_redundant.py
test_02_redundant_VPC_default_routes Failure 232.72 test_vpc_redundant.py
test_02_redundant_VPC_default_routes Error 232.74 test_vpc_redundant.py
test_03_create_redundant_VPC_1tier_2VMs_2IPs_2PF_ACL_reboot_routers Failure 238.72 test_vpc_redundant.py
test_03_create_redundant_VPC_1tier_2VMs_2IPs_2PF_ACL_reboot_routers Error 238.74 test_vpc_redundant.py
test_05_rvpc_multi_tiers Failure 236.99 test_vpc_redundant.py
test_05_rvpc_multi_tiers Error 237.01 test_vpc_redundant.py
test_01_redundant_vpc_site2site_vpn Failure 187.57 test_vpc_vpn.py

@JoaoJandre
Copy link
Copy Markdown
Contributor Author

@DaanHoogland can we rerun the tests?

@JoaoJandre
Copy link
Copy Markdown
Contributor Author

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@JoaoJandre a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 16882

@sureshanaparti
Copy link
Copy Markdown
Contributor

@blueorangutan test ol9 xcpng83

@blueorangutan
Copy link
Copy Markdown

@sureshanaparti a [SL] Trillian-Jenkins test job (ol9 mgmt + xcpng83) has been kicked to run smoke tests

@blueorangutan
Copy link
Copy Markdown

@nvazquez a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 17138

@nvazquez
Copy link
Copy Markdown
Contributor

@blueorangutan test ol9 xcpng83

@blueorangutan
Copy link
Copy Markdown

@nvazquez a [SL] Trillian-Jenkins test job (ol9 mgmt + xcpng83) has been kicked to run smoke tests

@blueorangutan
Copy link
Copy Markdown

[SF] Trillian test result (tid-15652)
Environment: xcpng83 (x2), zone: Advanced Networking with Mgmt server ol9
Total time taken: 62267 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr12597-t15652-xcpng83.zip
Smoke tests completed. 140 look OK, 9 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_events_resource Error 135.40 test_events_resource.py
test_07_arping_in_vr Failure 5.22 test_diagnostics.py
test_list_system_vms_metrics_history Failure 0.26 test_metrics_api.py
test_list_vms_metrics_history Failure 144.37 test_metrics_api.py
test_01_non_strict_host_anti_affinity Error 162.76 test_nonstrict_affinity_group.py
test_02_non_strict_host_affinity Error 74.30 test_nonstrict_affinity_group.py
test_02_list_snapshots_with_removed_data_store Error 9.67 test_snapshots.py
test_02_list_snapshots_with_removed_data_store Error 9.68 test_snapshots.py
test_01_scale_up_verify Failure 466.41 test_vm_autoscaling.py
test_02_update_vmprofile_and_vmgroup Failure 258.03 test_vm_autoscaling.py
test_06_autoscaling_vmgroup_on_project_network Failure 355.36 test_vm_autoscaling.py
test_06_autoscaling_vmgroup_on_project_network Error 355.37 test_vm_autoscaling.py
test_07_autoscaling_vmgroup_on_vpc_network Failure 361.94 test_vm_autoscaling.py
test_07_autoscaling_vmgroup_on_vpc_network Error 361.95 test_vm_autoscaling.py
ContextSuite context=TestVmAutoScaling>:teardown Error 373.14 test_vm_autoscaling.py
test_11_destroy_vm_and_volumes Error 21.01 test_vm_life_cycle.py
test_01_migrate_vm_strict_tags_success Error 120.90 test_vm_strict_host_tags.py
test_01_vpc_site2site_vpn_multiple_options Failure 668.97 test_vpc_vpn.py

@JoaoJandre
Copy link
Copy Markdown
Contributor Author

@nvazquez I looked at the errors reported by Blue Orangutan but no issues were related to this PR.

@nvazquez
Copy link
Copy Markdown
Contributor

Thanks @JoaoJandre let me kick one more round of packaging and tests

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@nvazquez a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 17188

@nvazquez
Copy link
Copy Markdown
Contributor

@blueorangutan test ol9 xcpng83

@blueorangutan
Copy link
Copy Markdown

@nvazquez a [SL] Trillian-Jenkins test job (ol9 mgmt + xcpng83) has been kicked to run smoke tests

@blueorangutan
Copy link
Copy Markdown

[SF] Trillian test result (tid-15689)
Environment: xcpng83 (x2), zone: Advanced Networking with Mgmt server ol9
Total time taken: 69299 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr12597-t15689-xcpng83.zip
Smoke tests completed. 141 look OK, 8 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_events_resource Error 158.17 test_events_resource.py
test_list_system_vms_metrics_history Failure 0.21 test_metrics_api.py
test_list_vms_metrics_history Failure 162.74 test_metrics_api.py
test_01_non_strict_host_anti_affinity Error 160.26 test_nonstrict_affinity_group.py
test_02_non_strict_host_affinity Error 62.62 test_nonstrict_affinity_group.py
test_02_list_snapshots_with_removed_data_store Error 14.70 test_snapshots.py
test_02_list_snapshots_with_removed_data_store Error 14.70 test_snapshots.py
test_01_vpn_usage Error 1.10 test_usage.py
test_01_scale_up_verify Failure 466.28 test_vm_autoscaling.py
test_02_update_vmprofile_and_vmgroup Failure 253.79 test_vm_autoscaling.py
test_06_autoscaling_vmgroup_on_project_network Failure 363.09 test_vm_autoscaling.py
test_06_autoscaling_vmgroup_on_project_network Error 363.09 test_vm_autoscaling.py
test_07_autoscaling_vmgroup_on_vpc_network Failure 391.30 test_vm_autoscaling.py
test_07_autoscaling_vmgroup_on_vpc_network Error 391.31 test_vm_autoscaling.py
ContextSuite context=TestVmAutoScaling>:teardown Error 429.87 test_vm_autoscaling.py
test_11_destroy_vm_and_volumes Error 20.91 test_vm_life_cycle.py
test_01_migrate_vm_strict_tags_success Error 70.20 test_vm_strict_host_tags.py

Copy link
Copy Markdown
Contributor

@vladimirpetrov vladimirpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM based on manial testing.

Test case:

Pre-conditions:

  • CloudStack 4.22.1+ with PR #12597 applied
  • snapshot.delta.max = 3
  • XenServer/XCP-ng hypervisor configured
  • Volume attached to a VM or available

Test Steps:

  1. Create snapshot S1 on a volume
  2. Verify S1 is a full snapshot (check snapshot_store_ref table)
  3. Create snapshot S2
  4. Verify S2 is an incremental snapshot linked to S1
  5. Create snapshot S3
  6. Verify S3 is an incremental snapshot linked to S2
  7. Delete snapshot S1
  8. Verify S1 state changes to Hidden in snapshot_store_ref table
  9. Create snapshot S4
  10. Query the database to check the snapshot chain length
  11. Verify that S4 is a full snapshot (new chain started)

Expected Results:

  • S1 is created as a full snapshot
  • S2 and S3 are incremental snapshots
  • After deleting S1, it transitions to Hidden state
  • S4 should be a full snapshot because chain length (S1-Hidden, S2, S3) = 3 which equals snapshot.delta.max
  • No snapshots should remain in Ready state if older than retention policy

@vladimirpetrov
Copy link
Copy Markdown
Contributor

@blueorangutan test ol9 xcpng83

@blueorangutan
Copy link
Copy Markdown

@vladimirpetrov a [SL] Trillian-Jenkins test job (ol9 mgmt + xcpng83) has been kicked to run smoke tests

@JoaoJandre
Copy link
Copy Markdown
Contributor Author

Thanks for testing @vladimirpetrov

@blueorangutan
Copy link
Copy Markdown

[SF] Trillian test result (tid-15807)
Environment: xcpng83 (x2), zone: Advanced Networking with Mgmt server ol9
Total time taken: 63218 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr12597-t15807-xcpng83.zip
Smoke tests completed. 133 look OK, 9 have errors, 7 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_events_resource Error 166.25 test_events_resource.py
test_list_system_vms_metrics_history Failure 0.22 test_metrics_api.py
test_list_vms_metrics_history Failure 157.98 test_metrics_api.py
test_01_non_strict_host_anti_affinity Error 154.24 test_nonstrict_affinity_group.py
test_02_non_strict_host_affinity Error 66.05 test_nonstrict_affinity_group.py
test_06_disk_offering_strictness_false Failure 629.82 test_service_offerings.py
test_02_list_snapshots_with_removed_data_store Error 12.67 test_snapshots.py
test_02_list_snapshots_with_removed_data_store Error 12.67 test_snapshots.py
test_01_vpn_usage Error 1.11 test_usage.py
test_01_scale_up_verify Failure 516.62 test_vm_autoscaling.py
test_02_update_vmprofile_and_vmgroup Failure 258.07 test_vm_autoscaling.py
test_06_autoscaling_vmgroup_on_project_network Failure 425.57 test_vm_autoscaling.py
test_06_autoscaling_vmgroup_on_project_network Error 425.58 test_vm_autoscaling.py
test_07_autoscaling_vmgroup_on_vpc_network Failure 413.02 test_vm_autoscaling.py
test_07_autoscaling_vmgroup_on_vpc_network Error 413.03 test_vm_autoscaling.py
ContextSuite context=TestVmAutoScaling>:teardown Error 459.00 test_vm_autoscaling.py
test_11_destroy_vm_and_volumes Error 25.04 test_vm_life_cycle.py
test_01_migrate_vm_strict_tags_success Error 61.32 test_vm_strict_host_tags.py
all_test_vpc_redundant Skipped --- test_vpc_redundant.py
all_test_vpc_router_nics Skipped --- test_vpc_router_nics.py
all_test_vpc_vpn Skipped --- test_vpc_vpn.py
all_test_webhook_delivery Skipped --- test_webhook_delivery.py
all_test_webhook_lifecycle Skipped --- test_webhook_lifecycle.py
all_test_host_maintenance Skipped --- test_host_maintenance.py
all_test_hostha_kvm Skipped --- test_hostha_kvm.py

@sureshanaparti
Copy link
Copy Markdown
Contributor

@blueorangutan test ol8 xcpng83

@blueorangutan
Copy link
Copy Markdown

@sureshanaparti a [SL] Trillian-Jenkins test job (ol8 mgmt + xcpng83) has been kicked to run smoke tests

@blueorangutan
Copy link
Copy Markdown

[SF] Trillian test result (tid-15832)
Environment: xcpng83 (x2), zone: Advanced Networking with Mgmt server ol8
Total time taken: 57258 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr12597-t15832-xcpng83.zip
Smoke tests completed. 142 look OK, 7 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_events_resource Error 151.30 test_events_resource.py
test_01_non_strict_host_anti_affinity Error 144.18 test_nonstrict_affinity_group.py
test_02_non_strict_host_affinity Error 44.24 test_nonstrict_affinity_group.py
test_02_list_snapshots_with_removed_data_store Error 10.73 test_snapshots.py
test_02_list_snapshots_with_removed_data_store Error 10.74 test_snapshots.py
test_02_list_cpvm_vm Failure 0.04 test_ssvm.py
test_04_cpvm_internals Failure 0.05 test_ssvm.py
test_01_vpn_usage Error 0.04 test_usage.py
test_11_destroy_vm_and_volumes Error 13.92 test_vm_life_cycle.py
test_01_migrate_vm_strict_tags_success Error 45.05 test_vm_strict_host_tags.py

@sureshanaparti sureshanaparti merged commit 2a60305 into apache:4.22 Apr 10, 2026
24 of 26 checks passed
@github-project-automation github-project-automation Bot moved this from In Progress to Done in Apache CloudStack 4.22.1 Apr 10, 2026
dhslove added a commit to dhslove/ablestack-cloud that referenced this pull request Apr 21, 2026
Source local main commit:
- bb635f652f snapshot: preserve Xen snapshot chaining through hidden refs

Source Apache commits:
- 2a60305 Fix snapshot chaining on Xen (apache#12597)

Change summary:
- add a DAO method that lists snapshot-store refs by snapshot id, role, and a set of states
- update DefaultSnapshotStrategy.getSnapshotImageStoreRef(...) to consider both Ready and Hidden image-store refs
- align the unit test with the new DAO method and remove redundant null-path stubbing
- record Record 040 and mark 8608b4e as already satisfied in the history document

Functional impact:
- preserves Xen incremental snapshot chain lookup even when a parent snapshot is hidden on secondary storage
- reduces the chance of losing the expected parent chain and falling back to an incorrect full backup path
- keeps zone-scoped image-store lookup while widening acceptable persisted states

Validation:
- cherry-pick from main applied cleanly on ablestack-europa with no additional manual conflict resolution
- mvn/mvnw-based tests not run in this environment by request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

After upgrade to 4.22.0.0 snapshot.delta.max not respected

8 participants