Provisioning Paradigms: Terraform, SDKs, & K8s Operators
Terraform (Declarative)
Section titled “Terraform (Declarative)”You define the desired end state, and the Terraform engine calculates the “delta” required to get there. It abstracts away the complexity of API sequencing.
- Focus: What the infrastructure should look like.
- Logic: Graph-based (DAG). Terraform determines dependencies and parallelism automatically.
Cloud SDKs (Imperative)
Section titled “Cloud SDKs (Imperative)”You define the exact steps to execute. You are directly invoking the Cloud Provider’s API endpoints (RPC/REST calls).
- Focus: How to provision the infrastructure.
- Logic: Procedural. You control the flow, retries, error handling, and concurrency (using goroutines/channels).
2. State Management & Drift Detection
Section titled “2. State Management & Drift Detection”This is the most critical differentiator for system design:
-
Terraform: Maintains a
tfstatefile (a persistent mapping of your configuration to real-world resource IDs).- Drift Detection: If you change a tag manually in the AWS Console, Terraform detects this drift on the next
planand offers to fix it. - Resource Lifecycle: It knows that to delete a VPC, it must first delete the EC2 instances inside it, because it understands the dependency graph stored in the state.
- Drift Detection: If you change a tag manually in the AWS Console, Terraform detects this drift on the next
-
SDKs: Stateless.
- The SDK has no memory of what you ran 5 minutes ago.
- To implement “Drift Detection,” you must write code to:
- Fetch the resource (
Describe/Get). - Compare the returned struct against your desired config.
- Decide whether to
UpdateorCreate.
- Fetch the resource (
3. Idempotency
Section titled “3. Idempotency”- Terraform: Idempotent by default. Running
terraform apply100 times results in the same state as running it once (assuming no external changes). - SDKs: Not idempotent by default. If you run a
CreateInstancecall twice, you will get two instances (or an error if naming conflicts exist). You must implement idempotency logic (e.g., “Check-if-exists-before-create”) manually.
4. Code Comparison: Creating an S3 Bucket
Section titled “4. Code Comparison: Creating an S3 Bucket”To illustrate the verbosity and error-handling burden differences:
Terraform
Section titled “Terraform”The provider handles authentication, API versioning, and state mapping.
resource "aws_s3_bucket" "example" { bucket = "my-test-bucket"
tags = { Environment = "Dev" }}Go SDK (aws-sdk-go-v2)
Section titled “Go SDK (aws-sdk-go-v2)”You manage the context, inputs, pointers, and explicit error propagation.
package main
import ( "context" "log"
"github.com/aws/aws-sdk-go-v2/aws" "github.com/aws/aws-sdk-go-v2/config" "github.com/aws/aws-sdk-go-v2/service/s3" "github.com/aws/aws-sdk-go-v2/service/s3/types")
func main() { ctx := context.TODO() cfg, err := config.LoadDefaultConfig(ctx) if err != nil { log.Fatalf("unable to load SDK config, %v", err) }
client := s3.NewFromConfig(cfg)
input := &s3.CreateBucketInput{ Bucket: aws.String("my-test-bucket"), CreateBucketConfiguration: &types.CreateBucketConfiguration{ LocationConstraint: types.BucketLocationConstraintUsWest2, }, }
// You must manually handle the case where the bucket already exists _, err = client.CreateBucket(ctx, input) if err != nil { // Custom logic to detect "BucketAlreadyExists" or "BucketAlreadyOwnedByYou" log.Printf("failed to create bucket: %v", err) }}5. Summary Comparison Table
Section titled “5. Summary Comparison Table”| Feature | Terraform | Cloud SDK (Go/Python/etc.) |
|---|---|---|
| Primary Use Case | Infrastructure Provisioning & Lifecycle Management. | Application Logic, Dynamic Tasks, Platform Building. |
| State | Managed via State File (.tfstate). | You must build your own state store. |
| Cleanup | terraform destroy removes everything in reverse dependency order. | You must write a script to find and delete resources. |
| Wait Logic | Built-in (e.g., waits for an EC2 IP to be assigned). | Manual (you must poll DescribeInstance until ready). |
| Locking | State locking prevents concurrent modifications. | No built-in locking mechanism. |
6. When to use which?
Section titled “6. When to use which?”Use Terraform when:
- You are defining the static infrastructure backbone (VPCs, Databases, Kubernetes Clusters, IAM Roles).
- You need a “Source of Truth” for your infrastructure.
- You want to empower a team to review infrastructure changes via PRs (
git diffof HCL).
Use SDKs when:
- Building Internal Platforms: You are writing a Go service that spins up temporary environments for developers on demand.
- Lambda/Application Code: Your application needs to upload a file to S3 or pull a message from SQS at runtime.
- Complex Logic: You need loops, conditionals, or database lookups that are too complex for HCL (HashiCorp Configuration Language).
- Writing Terraform Providers: Interestingly, Terraform Providers are written in Go using these very SDKs. The provider bridges the gap, wrapping the imperative SDK calls into the declarative Terraform lifecycle.
A Note on the “Systems Design” Approach
Section titled “A Note on the “Systems Design” Approach”- Terraform is the Bounded Context definition for your infrastructure. It defines the rigid structure of the system.
- SDKs are for the Anti-Corruption Layer or internal application logic that operates within that infrastructure.
k8s custom resource vs terraform
Section titled “k8s custom resource vs terraform”Kubernetes Custom Resource Definitions (CRDs) with a Custom Controller (Operator), you are effectively moving the “State Engine” from a static file on your laptop (Terraform’s tfstate) into the active, distributed database of the cluster (etcd). From a Systems Design perspective, this transitions your infrastructure management from Edge-Triggered (run by a human/CI) to Level-Triggered (run continuously by the cluster).
1. The Reconciliation Loop (The “Go” Logic)
Section titled “1. The Reconciliation Loop (The “Go” Logic)”In Terraform, reconciliation happens only when you invoke the binary. In a K8s Operator, reconciliation is a continuous loop.
The Operator Pattern (simplified Go logic): The K8s control plane acts as an event bus. Your Go code (the Controller) subscribes to events on your CRD.
- Observe: Fetch the CR (Custom Resource) from K8s API (Desired State).
- Observe: Fetch the real-world resource (e.g., AWS S3 Bucket) via SDK (Actual State).
- Diff: Compare Desired vs. Actual.
- Act: If different, call AWS SDK
UpdateorCreate. - Re-queue: Schedule the next check (e.g., in 5 minutes) to ensure no drift occurred outside of events.
2. Terraform vs. Custom Operators: Architectural Comparison
Section titled “2. Terraform vs. Custom Operators: Architectural Comparison”| Feature | Terraform | K8s Operator (CRD + Controller) |
|---|---|---|
| State Storage | terraform.tfstate (JSON file). Often a source of merge conflicts or locking issues. | etcd. Highly available, distributed key-value store. Built-in locking (ResourceVersions). |
| Drift Correction | Passive. Drift is only detected when a pipeline runs terraform plan. | Active (Self-Healing). The controller runs constantly. If someone deletes the bucket manually, the operator recreates it within seconds. |
| API Surface | HCL (HashiCorp Configuration Language). | YAML / K8s API. You can use kubectl to manage cloud resources. |
| Complexity | Low to Medium. Binary execution. | High. Requires a running K8s cluster, deployment management, and RBAC handling. |
| Latency | High (CI pipeline startup time). | Low (Running Go binary inside the cluster). |
3. The “Crossplane” Context
Section titled “3. The “Crossplane” Context”It is vital to note that what you are describing—writing CRDs to manage cloud resources—is exactly what the open-source project Crossplane does.
Instead of writing a custom Go operator to manage an S3 bucket or RDS instance, Crossplane provides these CRDs and Controllers out of the box. Ideally, you would only write a Custom Operator if:
- You have a complex domain-specific logic that combines multiple cloud resources (e.g., a “Tenant” CRD that spins up a VPC + RDS + Namespace).
- You are orchestrating a legacy internal system with no existing provider.
4. Implementation in Go: The “Controller-Runtime”
Section titled “4. Implementation in Go: The “Controller-Runtime””Since you prioritize idiomatic Go and robust design, writing an operator typically involves the kubernetes-sigs/controller-runtime library.
While you avoid 3rd party packages, controller-runtime is the de-facto standard implementation of the K8s API machinery. Writing a raw controller using only client-go is possible but involves massive boilerplate for leader election, work queues, and cache syncing.
A domain-first approach to an Operator:
-
Define the API (Structs):
// MyDatabase defines the desired statetype MyDatabaseSpec struct {// +kubebuilder:validation:RequiredEngineVersion string `json:"engineVersion"`StorageGB int `json:"storageGB"`}// MyDatabaseStatus defines the observed statetype MyDatabaseStatus struct {Endpoint string `json:"endpoint"`Status string `json:"status"` // e.g., "Ready", "Provisioning"} -
The Reconcile Function: This is where your defensive error handling comes in. You must assume the network will fail, the cloud API will rate-limit you, and the resource might be half-created.
func (r *MyDatabaseReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {log := log.FromContext(ctx)// 1. Fetch CRDvar db myappv1.MyDatabaseif err := r.Get(ctx, req.NamespacedName, &db); err != nil {return ctrl.Result{}, client.IgnoreNotFound(err)}// 2. Logic: Check Cloud Provider (Idempotency is key here)// If it doesn't exist, create it via SDK.// If it exists but spec differs, update it.// 3. Update Status in K8sdb.Status.Status = "Ready"if err := r.Status().Update(ctx, &db); err != nil {return ctrl.Result{}, err}return ctrl.Result{}, nil}
When to use which?
Section titled “When to use which?”- Use Terraform for Static Infrastructure (The “Base”): The VPC, the K8s cluster itself, the networking foundation. These change rarely, and “Active Reconciliation” is overkill.
- Use Operators for Dynamic/App-Centric Infrastructure (The “Layer”): Databases, Queues, Buckets that belong to a specific microservice. This allows developers to provision infra simply by committing a YAML file alongside their code, and the cluster ensures that infra exists and stays correct.
Terraform’s architecture is “Client-Side Reconciliation.” It only knows the truth when you explicitly wake it up.
1. The “Refresh” Phase: How Terraform sees the Console Change
Section titled “1. The “Refresh” Phase: How Terraform sees the Console Change”When you run terraform plan, Terraform performs a three-step dance. It doesn’t just look at your file; it queries the live API.
- Load State: It loads
terraform.tfstate(what it thinks exists). - Refresh (The API Calls): It iterates over every resource in the state file and calls the provider’s
Readmethod (e.g.,aws_s3_bucket.Read).- Scenario: You manually added a tag
Emergency=Truein the AWS Console. - Result: The API returns the resource with the new tag. Terraform updates its in-memory view of the world to match reality.
- Scenario: You manually added a tag
- Diff: It compares Config (HCL) vs. Refreshed State (Reality).
2. The Consequence: “The War for Control”
Section titled “2. The Consequence: “The War for Control””If your HCL does not have that tag, Terraform calculates a negative delta.
~ resource "aws_s3_bucket" "example" { tags = { "Environment" = "Dev" "Emergency" = "True" -> null } }Why this gets “out of hand”: If you have a workflow where operators must use the console (e.g., for hotfixes, capacity adjustments, or emergency toggles), Terraform becomes a destructive force.
- The Next Deployment: When the CI pipeline runs a routine deployment next week, it will silently revert your manual hotfix, potentially causing the outage to recur.
- State Bloat: If you create new resources in the console (e.g., a new EC2 instance), Terraform does not know they exist. It doesn’t track them. It only tracks what is in its state file. You now have “Shadow IT”—infrastructure that is unmanaged and invisible to your IaC.
3. The Solution: lifecycle Blocks (The Anti-Corruption Layer)
Section titled “3. The Solution: lifecycle Blocks (The Anti-Corruption Layer)”Since you cannot always ban console access, Terraform provides a mechanism to explicitly ignore specific types of drift. This is effectively an “Ignore Filter” in the reconciliation logic.
In your HCL, you can instruct the graph walker to ignore differences in specific fields:
resource "aws_autoscaling_group" "worker_pool" { desired_capacity = 5 max_size = 10 min_size = 2
# The "Hybrid Workflow" Safety Valve lifecycle { ignore_changes = [ # Ignore manual scaling events happened in console/CloudWatch desired_capacity, # Ignore tags added by external systems (e.g., Cost Explorer) tags["CostCenter"], ] }}With this block:
- You change
desired_capacityto8in the Console. terraform planruns.- It sees the drift (5 vs 8).
- It checks
ignore_changes. - It suppresses the diff. It says “No changes needed.”
4. Summary: Terraform vs. Active Reconciliation (K8s)
Section titled “4. Summary: Terraform vs. Active Reconciliation (K8s)”This highlights the fundamental architectural difference we discussed:
| Feature | Terraform (Client-Side) | K8s Operator (Server-Side) |
|---|---|---|
| Drift Reaction | Delayed & Explicit. Drift persists until the next apply. | Immediate & Automatic. The controller wakes up (via watch event) and reverts the change instantly. |
| Manual Edits | Dangerous. Risk of accidental reversion during next deploy. | Futile. You change it in console; the operator changes it back 1 second later. |
| Shadow Resources | Ignored. New resources created in console are invisible to TF. | Ignored. Unless you write a controller specifically to scan for “Adoptable” resources. |
Kubernetes’ Imperative Operator+Controller logic to Declarative YAML vs Terraform’s decarative manifests
Section titled “Kubernetes’ Imperative Operator+Controller logic to Declarative YAML vs Terraform’s decarative manifests”In Terraform, the Provider gives you the CRUD logic, and Core gives you the Lifecycle (Dependency Graph + Cleanup Order). When you write a Custom Controller in Go, you become the Platform Engineer and the Core Engine. You are responsible for implementing the logic that Terraform gives you for free.
1. Reusability: tfvars vs. Custom Resources (CRs)
Section titled “1. Reusability: tfvars vs. Custom Resources (CRs)”You asked about using the “same structure” for Prod and Dev.
- Terraform: You write a Module (HCL). You instantiate it twice using different input variables (
prod.tfvars,dev.tfvars). - Operator: You write a CRD (The Schema/Struct). You instantiate it twice using different CRs (YAML files).
The Go Implementation: Your Go struct is the schema.
// The "Module" Definition (CRD)type MyInfrastructureSpec struct { InstanceSize string `json:"instanceSize"` // "t3.micro" for dev, "m5.large" for prod Environment string `json:"environment"` // "dev" or "prod"}The Usage (YAML):
Instead of terraform apply -var-file=..., you apply specific YAMLs:
kind: MyInfrastructuremetadata: { name: "prod-stack" }spec: instanceSize: "m5.large" environment: "prod"Verdict: This part is easy. The reusability is effectively the same. The CRD acts as the Class, the CR acts as the Instance.
2. The Hard Part: Cleanup (The “Finalizer” Pattern)
Section titled “2. The Hard Part: Cleanup (The “Finalizer” Pattern)”This is where your assumption is 100% correct. Terraform knows to delete resources in reverse dependency order automatically. K8s does not.
If you delete a CR (Custom Resource), K8s simply deletes the record from Etcd. It does not automatically call the AWS SDK to delete the S3 bucket associated with it. You must implement the Finalizer Pattern in Go.
The “Cleanup Protocol” (Your Go Logic):
- Add Finalizer: On creation (
Reconcile), you must append a string (e.g.,my.infra/finalizer) to the metadata of the resource. This tells K8s: “Do not hard-delete this object until I say so.” - Intercept Deletion: When a user runs
kubectl delete, K8s sets adeletionTimestampon the object. It does not remove it. - Execute Cleanup: Your Reconcile loop runs. It sees
deletionTimestampis not zero.- You initialize the AWS SDK.
- You call
s3Client.DeleteBucket. - Crucial Defensive Coding: You must handle cases where the bucket is already gone (idempotency) or cannot be deleted (not empty).
- Remove Finalizer: Only after the SDK confirms deletion do you remove the string from metadata. K8s then garbage collects the object.
If your Go code crashes or fails during step 3, the resource gets “stuck” in Terminating state forever.
3. Resource Relations & Dependency Graph
Section titled “3. Resource Relations & Dependency Graph”Terraform calculates the DAG before execution. An Operator discovers dependencies at Runtime (Eventual Consistency).
Scenario: You need an AWS VPC created before you create an RDS Database.
Terraform: Implicitly waits. It won’t start the RDS API call until the VPC API call returns an ID.
Custom Operator: You must write “Requeue Logic.”
func (r *RDSReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { // 1. Fetch the RDS CR var rds myapp.RDS r.Get(ctx, req.NamespacedName, &rds)
// 2. Fetch the dependency (The VPC CR) var vpc myapp.VPC if err := r.Get(ctx, types.NamespacedName{Name: rds.Spec.VPCRef}, &vpc); err != nil { // ERROR: The VPC CR doesn't exist yet! // We cannot create the DB. // We must RETRY later. return ctrl.Result{RequeueAfter: 10 * time.Second}, nil }
// 3. Check if the VPC is actually ready in AWS if vpc.Status.State != "Ready" { // The VPC CR exists, but AWS is still provisioning it. // Wait and Retry. return ctrl.Result{RequeueAfter: 10 * time.Second}, nil }
// 4. Now safe to call RDS SDK // ...}The Complexity: You are manually writing the logic that mimics Terraform’s graph walker. If you have a chain of 5 resources, you have to manage the “Wait/Retry” loops for all of them.
Summary: The Trade-off
Section titled “Summary: The Trade-off”| Aspect | Terraform | Custom Operator (Go) |
|---|---|---|
| Logic Source | HCL (Declarative) | Go (Imperative disguised as Declarative) |
| Cleanup | Automatic (Reverse DAG) | Manual (Finalizers). You must write code to ensure clean teardown. |
| Dependencies | Calculated Pre-flight | Polled at Runtime. You must handle “Missing Dependency” errors gracefully. |
| Drift | Only corrected on apply | Self-Healing. Continuously corrected. |
Systems Design
Section titled “Systems Design”If your goal is simply to provision standard infrastructure (VPC, DB, K8s Cluster) for Prod/Dev, writing a custom operator is usually an anti-pattern. You are reimplementing the wheel (Terraform) with a square shape (Go loops).
However, the pattern becomes valid if you use Crossplane. Crossplane is a set of Operators that have already implemented the Cleanup/Finalizers and Dependency Logic for AWS/GCP/Azure. You then write a thin “Composition” layer (YAML) to group them.
If your system requires frequent, dynamic changes (like auto-scaling, spot instance requests, or circuit breaker toggles), Terraform is the wrong tool for those specific attributes.
- Static Layer (Terraform): VPCs, IAM, Security Groups. (Strict control, drift is bad).
- Dynamic Layer (App/Operator): Scaling counts,Feature Flags. (Fluid control, drift is expected).