Skip to content

Provisioning Paradigms: Terraform, SDKs, & K8s Operators

You define the desired end state, and the Terraform engine calculates the “delta” required to get there. It abstracts away the complexity of API sequencing.

  • Focus: What the infrastructure should look like.
  • Logic: Graph-based (DAG). Terraform determines dependencies and parallelism automatically.

You define the exact steps to execute. You are directly invoking the Cloud Provider’s API endpoints (RPC/REST calls).

  • Focus: How to provision the infrastructure.
  • Logic: Procedural. You control the flow, retries, error handling, and concurrency (using goroutines/channels).

This is the most critical differentiator for system design:

  • Terraform: Maintains a tfstate file (a persistent mapping of your configuration to real-world resource IDs).

    • Drift Detection: If you change a tag manually in the AWS Console, Terraform detects this drift on the next plan and offers to fix it.
    • Resource Lifecycle: It knows that to delete a VPC, it must first delete the EC2 instances inside it, because it understands the dependency graph stored in the state.
  • SDKs: Stateless.

    • The SDK has no memory of what you ran 5 minutes ago.
    • To implement “Drift Detection,” you must write code to:
      1. Fetch the resource (Describe/Get).
      2. Compare the returned struct against your desired config.
      3. Decide whether to Update or Create.
  • Terraform: Idempotent by default. Running terraform apply 100 times results in the same state as running it once (assuming no external changes).
  • SDKs: Not idempotent by default. If you run a CreateInstance call twice, you will get two instances (or an error if naming conflicts exist). You must implement idempotency logic (e.g., “Check-if-exists-before-create”) manually.

To illustrate the verbosity and error-handling burden differences:

The provider handles authentication, API versioning, and state mapping.

resource "aws_s3_bucket" "example" {
bucket = "my-test-bucket"
tags = {
Environment = "Dev"
}
}

You manage the context, inputs, pointers, and explicit error propagation.

package main
import (
"context"
"log"
"github.com/aws/aws-sdk-go-v2/aws"
"github.com/aws/aws-sdk-go-v2/config"
"github.com/aws/aws-sdk-go-v2/service/s3"
"github.com/aws/aws-sdk-go-v2/service/s3/types"
)
func main() {
ctx := context.TODO()
cfg, err := config.LoadDefaultConfig(ctx)
if err != nil {
log.Fatalf("unable to load SDK config, %v", err)
}
client := s3.NewFromConfig(cfg)
input := &s3.CreateBucketInput{
Bucket: aws.String("my-test-bucket"),
CreateBucketConfiguration: &types.CreateBucketConfiguration{
LocationConstraint: types.BucketLocationConstraintUsWest2,
},
}
// You must manually handle the case where the bucket already exists
_, err = client.CreateBucket(ctx, input)
if err != nil {
// Custom logic to detect "BucketAlreadyExists" or "BucketAlreadyOwnedByYou"
log.Printf("failed to create bucket: %v", err)
}
}
FeatureTerraformCloud SDK (Go/Python/etc.)
Primary Use CaseInfrastructure Provisioning & Lifecycle Management.Application Logic, Dynamic Tasks, Platform Building.
StateManaged via State File (.tfstate).You must build your own state store.
Cleanupterraform destroy removes everything in reverse dependency order.You must write a script to find and delete resources.
Wait LogicBuilt-in (e.g., waits for an EC2 IP to be assigned).Manual (you must poll DescribeInstance until ready).
LockingState locking prevents concurrent modifications.No built-in locking mechanism.

Use Terraform when:

  • You are defining the static infrastructure backbone (VPCs, Databases, Kubernetes Clusters, IAM Roles).
  • You need a “Source of Truth” for your infrastructure.
  • You want to empower a team to review infrastructure changes via PRs (git diff of HCL).

Use SDKs when:

  • Building Internal Platforms: You are writing a Go service that spins up temporary environments for developers on demand.
  • Lambda/Application Code: Your application needs to upload a file to S3 or pull a message from SQS at runtime.
  • Complex Logic: You need loops, conditionals, or database lookups that are too complex for HCL (HashiCorp Configuration Language).
  • Writing Terraform Providers: Interestingly, Terraform Providers are written in Go using these very SDKs. The provider bridges the gap, wrapping the imperative SDK calls into the declarative Terraform lifecycle.

A Note on the “Systems Design” Approach

Section titled “A Note on the “Systems Design” Approach”
  1. Terraform is the Bounded Context definition for your infrastructure. It defines the rigid structure of the system.
  2. SDKs are for the Anti-Corruption Layer or internal application logic that operates within that infrastructure.

Kubernetes Custom Resource Definitions (CRDs) with a Custom Controller (Operator), you are effectively moving the “State Engine” from a static file on your laptop (Terraform’s tfstate) into the active, distributed database of the cluster (etcd). From a Systems Design perspective, this transitions your infrastructure management from Edge-Triggered (run by a human/CI) to Level-Triggered (run continuously by the cluster).

1. The Reconciliation Loop (The “Go” Logic)

Section titled “1. The Reconciliation Loop (The “Go” Logic)”

In Terraform, reconciliation happens only when you invoke the binary. In a K8s Operator, reconciliation is a continuous loop.

The Operator Pattern (simplified Go logic): The K8s control plane acts as an event bus. Your Go code (the Controller) subscribes to events on your CRD.

  1. Observe: Fetch the CR (Custom Resource) from K8s API (Desired State).
  2. Observe: Fetch the real-world resource (e.g., AWS S3 Bucket) via SDK (Actual State).
  3. Diff: Compare Desired vs. Actual.
  4. Act: If different, call AWS SDK Update or Create.
  5. Re-queue: Schedule the next check (e.g., in 5 minutes) to ensure no drift occurred outside of events.

2. Terraform vs. Custom Operators: Architectural Comparison

Section titled “2. Terraform vs. Custom Operators: Architectural Comparison”
FeatureTerraformK8s Operator (CRD + Controller)
State Storageterraform.tfstate (JSON file). Often a source of merge conflicts or locking issues.etcd. Highly available, distributed key-value store. Built-in locking (ResourceVersions).
Drift CorrectionPassive. Drift is only detected when a pipeline runs terraform plan.Active (Self-Healing). The controller runs constantly. If someone deletes the bucket manually, the operator recreates it within seconds.
API SurfaceHCL (HashiCorp Configuration Language).YAML / K8s API. You can use kubectl to manage cloud resources.
ComplexityLow to Medium. Binary execution.High. Requires a running K8s cluster, deployment management, and RBAC handling.
LatencyHigh (CI pipeline startup time).Low (Running Go binary inside the cluster).

It is vital to note that what you are describing—writing CRDs to manage cloud resources—is exactly what the open-source project Crossplane does.

Instead of writing a custom Go operator to manage an S3 bucket or RDS instance, Crossplane provides these CRDs and Controllers out of the box. Ideally, you would only write a Custom Operator if:

  1. You have a complex domain-specific logic that combines multiple cloud resources (e.g., a “Tenant” CRD that spins up a VPC + RDS + Namespace).
  2. You are orchestrating a legacy internal system with no existing provider.

4. Implementation in Go: The “Controller-Runtime”

Section titled “4. Implementation in Go: The “Controller-Runtime””

Since you prioritize idiomatic Go and robust design, writing an operator typically involves the kubernetes-sigs/controller-runtime library.

While you avoid 3rd party packages, controller-runtime is the de-facto standard implementation of the K8s API machinery. Writing a raw controller using only client-go is possible but involves massive boilerplate for leader election, work queues, and cache syncing.

A domain-first approach to an Operator:

  1. Define the API (Structs):

    // MyDatabase defines the desired state
    type MyDatabaseSpec struct {
    // +kubebuilder:validation:Required
    EngineVersion string `json:"engineVersion"`
    StorageGB int `json:"storageGB"`
    }
    // MyDatabaseStatus defines the observed state
    type MyDatabaseStatus struct {
    Endpoint string `json:"endpoint"`
    Status string `json:"status"` // e.g., "Ready", "Provisioning"
    }
  2. The Reconcile Function: This is where your defensive error handling comes in. You must assume the network will fail, the cloud API will rate-limit you, and the resource might be half-created.

    func (r *MyDatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    log := log.FromContext(ctx)
    // 1. Fetch CRD
    var db myappv1.MyDatabase
    if err := r.Get(ctx, req.NamespacedName, &db); err != nil {
    return ctrl.Result{}, client.IgnoreNotFound(err)
    }
    // 2. Logic: Check Cloud Provider (Idempotency is key here)
    // If it doesn't exist, create it via SDK.
    // If it exists but spec differs, update it.
    // 3. Update Status in K8s
    db.Status.Status = "Ready"
    if err := r.Status().Update(ctx, &db); err != nil {
    return ctrl.Result{}, err
    }
    return ctrl.Result{}, nil
    }
  • Use Terraform for Static Infrastructure (The “Base”): The VPC, the K8s cluster itself, the networking foundation. These change rarely, and “Active Reconciliation” is overkill.
  • Use Operators for Dynamic/App-Centric Infrastructure (The “Layer”): Databases, Queues, Buckets that belong to a specific microservice. This allows developers to provision infra simply by committing a YAML file alongside their code, and the cluster ensures that infra exists and stays correct.

Terraform’s architecture is “Client-Side Reconciliation.” It only knows the truth when you explicitly wake it up.

1. The “Refresh” Phase: How Terraform sees the Console Change

Section titled “1. The “Refresh” Phase: How Terraform sees the Console Change”

When you run terraform plan, Terraform performs a three-step dance. It doesn’t just look at your file; it queries the live API.

  1. Load State: It loads terraform.tfstate (what it thinks exists).
  2. Refresh (The API Calls): It iterates over every resource in the state file and calls the provider’s Read method (e.g., aws_s3_bucket.Read).
    • Scenario: You manually added a tag Emergency=True in the AWS Console.
    • Result: The API returns the resource with the new tag. Terraform updates its in-memory view of the world to match reality.
  3. Diff: It compares Config (HCL) vs. Refreshed State (Reality).

2. The Consequence: “The War for Control”

Section titled “2. The Consequence: “The War for Control””

If your HCL does not have that tag, Terraform calculates a negative delta.

~ resource "aws_s3_bucket" "example" {
tags = {
"Environment" = "Dev"
"Emergency" = "True" -> null
}
}

Why this gets “out of hand”: If you have a workflow where operators must use the console (e.g., for hotfixes, capacity adjustments, or emergency toggles), Terraform becomes a destructive force.

  • The Next Deployment: When the CI pipeline runs a routine deployment next week, it will silently revert your manual hotfix, potentially causing the outage to recur.
  • State Bloat: If you create new resources in the console (e.g., a new EC2 instance), Terraform does not know they exist. It doesn’t track them. It only tracks what is in its state file. You now have “Shadow IT”—infrastructure that is unmanaged and invisible to your IaC.

3. The Solution: lifecycle Blocks (The Anti-Corruption Layer)

Section titled “3. The Solution: lifecycle Blocks (The Anti-Corruption Layer)”

Since you cannot always ban console access, Terraform provides a mechanism to explicitly ignore specific types of drift. This is effectively an “Ignore Filter” in the reconciliation logic.

In your HCL, you can instruct the graph walker to ignore differences in specific fields:

resource "aws_autoscaling_group" "worker_pool" {
desired_capacity = 5
max_size = 10
min_size = 2
# The "Hybrid Workflow" Safety Valve
lifecycle {
ignore_changes = [
# Ignore manual scaling events happened in console/CloudWatch
desired_capacity,
# Ignore tags added by external systems (e.g., Cost Explorer)
tags["CostCenter"],
]
}
}

With this block:

  1. You change desired_capacity to 8 in the Console.
  2. terraform plan runs.
  3. It sees the drift (5 vs 8).
  4. It checks ignore_changes.
  5. It suppresses the diff. It says “No changes needed.”

4. Summary: Terraform vs. Active Reconciliation (K8s)

Section titled “4. Summary: Terraform vs. Active Reconciliation (K8s)”

This highlights the fundamental architectural difference we discussed:

FeatureTerraform (Client-Side)K8s Operator (Server-Side)
Drift ReactionDelayed & Explicit. Drift persists until the next apply.Immediate & Automatic. The controller wakes up (via watch event) and reverts the change instantly.
Manual EditsDangerous. Risk of accidental reversion during next deploy.Futile. You change it in console; the operator changes it back 1 second later.
Shadow ResourcesIgnored. New resources created in console are invisible to TF.Ignored. Unless you write a controller specifically to scan for “Adoptable” resources.

Kubernetes’ Imperative Operator+Controller logic to Declarative YAML vs Terraform’s decarative manifests

Section titled “Kubernetes’ Imperative Operator+Controller logic to Declarative YAML vs Terraform’s decarative manifests”

In Terraform, the Provider gives you the CRUD logic, and Core gives you the Lifecycle (Dependency Graph + Cleanup Order). When you write a Custom Controller in Go, you become the Platform Engineer and the Core Engine. You are responsible for implementing the logic that Terraform gives you for free.

1. Reusability: tfvars vs. Custom Resources (CRs)

Section titled “1. Reusability: tfvars vs. Custom Resources (CRs)”

You asked about using the “same structure” for Prod and Dev.

  • Terraform: You write a Module (HCL). You instantiate it twice using different input variables (prod.tfvars, dev.tfvars).
  • Operator: You write a CRD (The Schema/Struct). You instantiate it twice using different CRs (YAML files).

The Go Implementation: Your Go struct is the schema.

// The "Module" Definition (CRD)
type MyInfrastructureSpec struct {
InstanceSize string `json:"instanceSize"` // "t3.micro" for dev, "m5.large" for prod
Environment string `json:"environment"` // "dev" or "prod"
}

The Usage (YAML): Instead of terraform apply -var-file=..., you apply specific YAMLs:

prod.yaml
kind: MyInfrastructure
metadata: { name: "prod-stack" }
spec:
instanceSize: "m5.large"
environment: "prod"

Verdict: This part is easy. The reusability is effectively the same. The CRD acts as the Class, the CR acts as the Instance.

2. The Hard Part: Cleanup (The “Finalizer” Pattern)

Section titled “2. The Hard Part: Cleanup (The “Finalizer” Pattern)”

This is where your assumption is 100% correct. Terraform knows to delete resources in reverse dependency order automatically. K8s does not.

If you delete a CR (Custom Resource), K8s simply deletes the record from Etcd. It does not automatically call the AWS SDK to delete the S3 bucket associated with it. You must implement the Finalizer Pattern in Go.

The “Cleanup Protocol” (Your Go Logic):

  1. Add Finalizer: On creation (Reconcile), you must append a string (e.g., my.infra/finalizer) to the metadata of the resource. This tells K8s: “Do not hard-delete this object until I say so.”
  2. Intercept Deletion: When a user runs kubectl delete, K8s sets a deletionTimestamp on the object. It does not remove it.
  3. Execute Cleanup: Your Reconcile loop runs. It sees deletionTimestamp is not zero.
    • You initialize the AWS SDK.
    • You call s3Client.DeleteBucket.
    • Crucial Defensive Coding: You must handle cases where the bucket is already gone (idempotency) or cannot be deleted (not empty).
  4. Remove Finalizer: Only after the SDK confirms deletion do you remove the string from metadata. K8s then garbage collects the object.

If your Go code crashes or fails during step 3, the resource gets “stuck” in Terminating state forever.

Terraform calculates the DAG before execution. An Operator discovers dependencies at Runtime (Eventual Consistency).

Scenario: You need an AWS VPC created before you create an RDS Database.

Terraform: Implicitly waits. It won’t start the RDS API call until the VPC API call returns an ID.

Custom Operator: You must write “Requeue Logic.”

func (r *RDSReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
// 1. Fetch the RDS CR
var rds myapp.RDS
r.Get(ctx, req.NamespacedName, &rds)
// 2. Fetch the dependency (The VPC CR)
var vpc myapp.VPC
if err := r.Get(ctx, types.NamespacedName{Name: rds.Spec.VPCRef}, &vpc); err != nil {
// ERROR: The VPC CR doesn't exist yet!
// We cannot create the DB.
// We must RETRY later.
return ctrl.Result{RequeueAfter: 10 * time.Second}, nil
}
// 3. Check if the VPC is actually ready in AWS
if vpc.Status.State != "Ready" {
// The VPC CR exists, but AWS is still provisioning it.
// Wait and Retry.
return ctrl.Result{RequeueAfter: 10 * time.Second}, nil
}
// 4. Now safe to call RDS SDK
// ...
}

The Complexity: You are manually writing the logic that mimics Terraform’s graph walker. If you have a chain of 5 resources, you have to manage the “Wait/Retry” loops for all of them.

AspectTerraformCustom Operator (Go)
Logic SourceHCL (Declarative)Go (Imperative disguised as Declarative)
CleanupAutomatic (Reverse DAG)Manual (Finalizers). You must write code to ensure clean teardown.
DependenciesCalculated Pre-flightPolled at Runtime. You must handle “Missing Dependency” errors gracefully.
DriftOnly corrected on applySelf-Healing. Continuously corrected.

If your goal is simply to provision standard infrastructure (VPC, DB, K8s Cluster) for Prod/Dev, writing a custom operator is usually an anti-pattern. You are reimplementing the wheel (Terraform) with a square shape (Go loops).

However, the pattern becomes valid if you use Crossplane. Crossplane is a set of Operators that have already implemented the Cleanup/Finalizers and Dependency Logic for AWS/GCP/Azure. You then write a thin “Composition” layer (YAML) to group them.

If your system requires frequent, dynamic changes (like auto-scaling, spot instance requests, or circuit breaker toggles), Terraform is the wrong tool for those specific attributes.

  • Static Layer (Terraform): VPCs, IAM, Security Groups. (Strict control, drift is bad).
  • Dynamic Layer (App/Operator): Scaling counts,Feature Flags. (Fluid control, drift is expected).