fix(instancegroup): Ensure cleanup of partially created instances (!35) · Merge requests · CCL Consulting / fleeting-plugin-scaleway

Problem

When server creation failed at any intermediate step (e.g., during power-on, network attachment, or volume configuration), partially created resources (IPs, volumes, servers) remained orphaned in the Scaleway infrastructure, leading to resource leaks and unnecessary costs.

Solution

Implemented transactional instance creation with automatic rollback:

Added a deferred cleanup mechanism that triggers if Create() fails at any step
Ensures all partially created resources are immediately cleaned up, preventing orphaned resources
Instance ID is now populated immediately after server creation (before potential failures) to enable proper rollback

Key Changes

1. Transactional Creation Pattern (`handler_server.go`)

Added rollback flag with deferred cleanup in Create() function
Cleanup is automatically triggered on any creation failure
Instance context is populated early to enable rollback even if later steps fail

2. Retry Logic with Jitter

New retryOnTransientError() function handles transient API errors (ResourceLocked 423, TransientState 409)
Adds random jitter (0-500ms) to retry delays to prevent thundering herd
Applied to critical operations like power-off and volume operations

3. Enhanced Cleanup Robustness (`Cleanup()`)

Made idempotent and safe to call multiple times
Handles missing resources (404) gracefully
Discovers resources both by direct lookup and tag-based search
Comprehensive error collection and reporting

4. Comprehensive Testing

New test: TestServerHandlerCreate/error_during_server_startup_power-on
- Simulates power-on failure (500 Internal Server Error)
- Verifies automatic rollback cleans up all resources (IPs, volumes, server)
- Confirms instance ID was populated before failure
Fixed test: TestDecrease/success
- Reordered mock API expectations to match actual implementation
- Cleanup now gets server first, then proceeds with resource deletion

Impact

Prevents resource leaks: No more orphaned IPs, volumes, or servers
Cost reduction: Failed instances don't leave billable resources behind
Improved reliability: Transactional semantics ensure clean failure recovery
Better observability: Clear logging of rollback operations

Testing

All tests pass with make test:

✅ New startup failure test validates rollback mechanism
✅ Existing cleanup tests updated and passing
✅ All integration tests passing

Replaces !34 (closed)

Edited Nov 06, 2025 by zadkiel

fix(instancegroup): Ensure cleanup of partially created instances