fix(instancegroup): Ensure cleanup of partially created instances

Problem

When server creation failed at any intermediate step (e.g., during power-on, network attachment, or volume configuration), partially created resources (IPs, volumes, servers) remained orphaned in the Scaleway infrastructure, leading to resource leaks and unnecessary costs.

Solution

Implemented transactional instance creation with automatic rollback:

  • Added a deferred cleanup mechanism that triggers if Create() fails at any step
  • Ensures all partially created resources are immediately cleaned up, preventing orphaned resources
  • Instance ID is now populated immediately after server creation (before potential failures) to enable proper rollback

Key Changes

1. Transactional Creation Pattern (handler_server.go)

  • Added rollback flag with deferred cleanup in Create() function
  • Cleanup is automatically triggered on any creation failure
  • Instance context is populated early to enable rollback even if later steps fail

2. Retry Logic with Jitter

  • New retryOnTransientError() function handles transient API errors (ResourceLocked 423, TransientState 409)
  • Adds random jitter (0-500ms) to retry delays to prevent thundering herd
  • Applied to critical operations like power-off and volume operations

3. Enhanced Cleanup Robustness (Cleanup())

  • Made idempotent and safe to call multiple times
  • Handles missing resources (404) gracefully
  • Discovers resources both by direct lookup and tag-based search
  • Comprehensive error collection and reporting

4. Comprehensive Testing

  • New test: TestServerHandlerCreate/error_during_server_startup_power-on

    • Simulates power-on failure (500 Internal Server Error)
    • Verifies automatic rollback cleans up all resources (IPs, volumes, server)
    • Confirms instance ID was populated before failure
  • Fixed test: TestDecrease/success

    • Reordered mock API expectations to match actual implementation
    • Cleanup now gets server first, then proceeds with resource deletion

Impact

  • Prevents resource leaks: No more orphaned IPs, volumes, or servers
  • Cost reduction: Failed instances don't leave billable resources behind
  • Improved reliability: Transactional semantics ensure clean failure recovery
  • Better observability: Clear logging of rollback operations

Testing

All tests pass with make test:

  • New startup failure test validates rollback mechanism
  • Existing cleanup tests updated and passing
  • All integration tests passing

Replaces !34 (closed)

Edited by zadkiel

Merge request reports

Loading