DevOps Interview Questions

Prepare for your next DevOps interview with our curated list of questions covering CI/CD, Git, Docker, Kubernetes, Terraform, and system design.

Check DevOps Roadmap

Interview Questions Database

Master your concepts with 52 hand-picked questions

Filter by Experience Level

Both commands integrate changes from one branch into another, but they do it differently:

`git merge`: Creates a new 'merge commit' that ties together the histories of both branches. It is non-destructive and preserves the exact history, but can lead to a cluttered commit log.

`git rebase`: Moves the entire feature branch to begin on the tip of the main branch, rewriting project history by creating brand new commits. It results in a cleaner, linear history. However, never rebase commits that exist on a public/shared branch, as it rewrites history for everyone.

`git cherry-pick` allows you to apply the changes from a specific commit (or range of commits) from one branch onto another branch, without merging the entire branch.

Use cases:

1. You have a bug fix on a feature branch that needs to be quickly applied to the main/production branch without merging all feature branch code.

2. You accidentally committed to the wrong branch and need to move that specific commit.

Example: `git cherry-pick abc1234` applies only that one commit to the current branch.

Blue/Green deployment is a release technique that reduces downtime and risk by running two identical production environments called Blue and Green.

At any time, only one environment is live (e.g., Blue). As you prepare a new release, you deploy and test it in the idle environment (Green).

Once testing is complete, you switch the router (e.g., load balancer or DNS) so all traffic goes to Green. Green is now live. This provides a near-instant rollback mechanism — just switch the router back to Blue if anything goes wrong.

Continuous Integration (CI): Developers frequently merge code changes into a shared repository (multiple times a day). Each merge triggers an automated build and test suite to catch bugs early.

Continuous Delivery (CD): An extension of CI where all code changes that pass the automated tests are automatically released to a staging/pre-production environment. A human still approves the final production deployment.

Continuous Deployment: Goes one step further — every change that passes all automated stages is automatically deployed to production with NO manual approval required. Requires very mature test coverage.

An Image is a lightweight, standalone, executable package that includes everything needed to run an application: code, runtime, system libraries and settings. It is read-only and immutable — like a class definition.

A Container is a runtime instance of an image — what the image becomes in memory when executed. You can run multiple containers from a single image. A container is mutable; a read-write layer is added on top of the image when it starts. Think of it as an object instantiated from a class.

Both COPY and ADD instructions copy files from the host into the Docker image, but ADD has extra features:

`COPY`: Simply copies files/directories from a source on the host to the destination in the container. It is straightforward and predictable. Preferred for most use cases.

`ADD`: Does everything COPY does, plus:

1. It can automatically extract local tar archives (e.g., .tar.gz) into the destination.

2. It can download files from a remote URL (though using curl/wget is more transparent).

Best Practice: Use COPY unless you specifically need the extra features of ADD.

Containers in the same Pod share the same network namespace, including the IP address and port space.

Therefore, they can communicate with each other using `localhost` and the port the other container is listening on. They can also share data by mounting the same `emptyDir` volume:

```yaml

volumes:

- name: shared-data

emptyDir: {}

```

Both containers mount `/shared-data` and can read/write the same files there.

Deployment: Manages stateless applications. Pods are interchangeable — they can be created, deleted, and replaced in any order. Pod names are random (e.g., web-abc123). Best for web servers, API services.

StatefulSet: Manages stateful applications (databases, message queues). Pods have stable, unique network identities (e.g., db-0, db-1, db-2) and stable storage (PersistentVolumes). They are created/deleted in a strict sequential order. Best for applications that require persistent identity and storage, like MySQL, Kafka, or Elasticsearch.

The Terraform state file (`terraform.tfstate`) is a JSON file that records the current state of your managed infrastructure. It maps real-world resources to your configuration, tracks metadata, and helps Terraform determine what changes to make during a `terraform plan` or `apply`.

For team collaboration:

1. Never store the state file in version control (it may contain secrets).

2. Use a remote backend (AWS S3 + DynamoDB, or Terraform Cloud) to store state centrally.

3. Enable state locking (via DynamoDB) to prevent concurrent modifications from two team members running `terraform apply` simultaneously.

The three pillars of observability:

Metrics: Numerical measurements collected over time (e.g., CPU usage = 78%, HTTP requests per second = 500). They are great for dashboards and alerting. Tools: Prometheus, CloudWatch.

Logs: Timestamped records of discrete events generated by an application or system (e.g., 'ERROR: DB connection failed at 14:32:01'). They provide rich context for debugging. Tools: ELK Stack, Datadog Logs.

Traces: Track a single request's journey through multiple services in a distributed system, showing which service took how long. They are essential for identifying bottlenecks in microservices. Tools: Jaeger, AWS X-Ray, Datadog APM.

A zombie process (also called a 'defunct' process) is a process that has finished executing but still has an entry in the process table. This happens when a child process exits but its parent process hasn't yet called `wait()` to read the exit status.

Zombie processes don't consume CPU or memory, but they do consume a PID slot. If too many accumulate, no new processes can be created.

To identify: `ps aux | grep 'Z'` (Z state in the STAT column).

To handle: The real fix is to kill the parent process (which causes init/systemd to adopt and reap the zombie). You cannot kill a zombie process itself since it is already dead.

Authentication (AuthN): Verifying who you are. It is the process of confirming the identity of a user or service. Example: Entering your username and password to prove you are who you claim to be.

Authorization (AuthZ): Verifying what you are allowed to do. It determines the permissions and access rights of an authenticated user. Example: After logging in (authenticated), the system checks if your account has permission to access the admin panel (authorized).

Simple analogy: Authentication is showing your ID at the door. Authorization is the bouncer checking your name on the guest list.

The primary goal of CI is to detect integrational errors as quickly as possible. By having developers merge their code changes into a central repository frequently (sometimes several times a day), automated builds and tests are triggered, ensuring the mainline codebase remains stable.

IaC is the process of managing and provisioning computing infrastructure through machine-readable definition files (like Terraform or CloudFormation) rather than physical hardware configuration or interactive manual configuration tools.

A Virtual Machine runs a full guest operating system on top of a hypervisor, consuming significant memory and CPU. A Docker Container isolates applications at the process level, sharing the host machine's OS kernel directly, making containers extremely lightweight, fast to start, and massively portable.

Version control is a system that records changes to a file or set of files over time so you can recall specific versions later. Git is the standard because it is distributed (every developer has a full local copy of the history), highly performant, and handles branching/merging flawlessly.

Jenkins is an open-source automation server. It acts as the orchestration engine that integrates various DevOps stages, triggering code compilation, executing automated unit tests, and pushing the final built artifacts to a deployment server or registry.

It sits in front of your application servers and routes client requests across all servers capable of fulfilling them. If a server crashes, the load balancer detects the failure via health checks and instantly redirects traffic to the remaining healthy servers.

Microservices are an architectural style that structures an application as a collection of small, autonomous services modeled around a business domain. They communicate via lightweight APIs and can be developed, deployed, and scaled independently.

Kubernetes (K8s) is an open-source container orchestration engine. While Docker spins up single containers, K8s automates the deployment, scaling, load balancing, and management of thousands of containerized applications across a cluster of worker nodes.

Continuous Integration automatically builds and tests code when it's committed to Git. Continuous Delivery automatically prepares the code for release to staging. Continuous Deployment goes one step further and automatically deploys every validated change directly into production without manual intervention.

Because DevOps focuses on rapid deployments, monitoring and logging provide the essential feedback loop. Logs help debug specific errors when a deployment fails, while monitoring metrics (like CPU usage) trigger auto-scaling events to ensure system stability.

I would heavily parallelize the pipeline. E2E tests can be split across multiple Jenkins worker nodes executing simultaneously. I would also mandate that unit tests run first, instantly failing the build if they don't pass, before the slower integration tests ever start. Finally, I'd implement aggressive Docker layer caching and ensure dependencies are cached locally on the build runners.

I would explicitly reject baking secrets into the Docker image, as anyone with access to the registry can extract it. I would instruct them to create a Kubernetes Secret manifest. The password would be stored securely in the K8s etcd database (preferably encrypted at rest), and injected into the Spring Boot Pod securely via an Environment Variable or a mounted tmpfs volume at runtime.

I would utilize ECS Rolling Updates or Blue/Green deployment using AWS CodeDeploy. A rolling update involves ECS spinning up tasks with the new image, waiting for them to pass the Application Load Balancer (ALB) health checks, and then gracefully draining connections from the old tasks before terminating them, ensuring absolute zero traffic drops.

Because Terraform state represents what it managed to create before failing, I would simply re-run `terraform apply`. Terraform perfectly calculates the diff between the tracked partial state and the desired configuration, seamlessly picking up exactly where it left off. I might also configure exponential backoff or reduced parallelism in the AWS provider block to avoid hitting the API limits again.

Ansible uses an agentless Push model. The control node initiates an SSH connection directly to the target servers and pushes the Python modules to execute tasks. Puppet and Chef use a Pull model, requiring a dedicated agent daemon installed on every target server that continuously polls a central master server to fetch and apply configuration changes.

I would transition the Dockerfiles to utilize strict Multi-Stage Builds. The first stage contains the heavy SDKs and compilers to build the application artifact. The second (final) stage copies only the built binary/artifact into a heavily minimized production base image, such as Alpine Linux or Google Distroless, drastically reducing the final footprint to 50MB-100MB.

This is known as 'Configuration Drift'. The actual infrastructure has drifted from the codified intent. To fix this, I would run `terraform plan`, which immediately detects that the remote security group doesn't match the local HCL. Running `terraform apply` will forcefully revert the manual change out in AWS, aligning reality back perfectly with the codebase intent.

I would evaluate between a Multi-repo deeply isolating each service, or a strictly structured Monorepo (using tools like Bazel or Nx). If going Multi-repo, each microservice gets its own dedicated CI/CD pipeline allowing independent scaling and deployments, reducing blast radius perfectly while isolating team dependencies.

Prometheus actively discovers native targets dynamically through its extensive Kubernetes API service discovery engine (`kubernetes_sd_configs`). It queries the K8s API server for objects like Endpoints, Pods, or Services directly, dynamically updating its internal scraping pool as containers horizontally autoscale up or crash down.

I would heavily implement the `until` and `retries` keywords within the specific Ansible tasks that require network fetching. This tells the module to exponentially backoff and retry the command X times before fatally failing the specific host.

Rolling back the Docker image via Kubernetes is trivial (`kubectl rollout undo`), but the database schema is the complex state. I mandate an Expand/Contract strictly backward-compatible database pattern. You never rename or drop columns in the immediate migration. You Add columns. The old app ignores them, the new app writes to both. If the new app fails, rolling back the container is perfectly safe because the old app still understands the expanded schema perfectly. The destructive 'Contract' cleanup only happens weeks later.

Terraform state is effectively effectively plain text JSON. I would configure a remote backend (like an isolated S3 bucket) possessing strict IAM policies severely restricting access down to the CI/CD pipeline role alone. Critically, I would enable explicit KMS encryption-at-rest directly on the bucket, ensuring that even if an engineer breaches the S3 object read ACL natively, they cannot decrypt the structural payload.

The Pods likely lack strict `resources.requests` and `resources.limits` configurations. When memory fills up, the K8s Kubelet brutally evicts Pods possessing a 'BestEffort' QoS class first. I would enforce a strict OPA/Gatekeeper admission controller blocking any deployment submitted without explicitly defining minimum CPU/RAM requests, ensuring predictable completely algorithmic scheduling across the worker nodes.

I would implement Argo CD Image Updater. When the CI pipeline successfully builds and pushes `myapp:v2.0` to the Docker registry, the Image Updater actively polls the registry. Upon detecting the new strictly verified semver tag, it automatically commits the updated helm values back to the manifest Git repository automatically, closing the loop completely without manual engineer intervention.

I would strongly pivot the compute backend entirely to heavily utilize Kubernetes strictly riding on AWS Spot Instances via Karpenter autoscaler. Because analytical pipelines are intrinsically fault-tolerant (stateless), if AWS reclaims the spot capacity dynamically, Karpenter instantly respawns the workloads on other available cheap instances, slicing the compute bill by 70-90% effortlessly.

Istio utilizes an automated Mutating Admission Webhook inside Kubernetes. Whenever a Pod is scheduled, the webhook transparently intercepts the API request and mathematically injects a secondary 'Envoy Proxy' sidecar container directly into the Pod. All incoming/outgoing traffic is silently forced entirely through this sidecar through strictly manipulated iptables rules. The Envoy proxies handle the heavy cryptographic mTLS handshake invisibly on behalf of the ignorant applications.

This is structurally a classic 'Works on My Machine' paradigm. I would check the specific host architectures. A developer on an Apple M-series chip builds strictly arm64 Docker images natively. If the CI runner is executing on x86_64, strictly native C++ binding compilations inside the Node.js or Python packages will completely fail cross-platform. To fix this, I mandate utilizing 'buildx' explicitly compiling multi-arch manifests.

Standard Jenkins input steps are highly malleable. I would pivot essentially into deeply integrating ServiceNow or Jira webhook callbacks specifically into the Jenkins Groovy pipeline. The pipeline halts securely pending a cryptographic API callback from the ITSM system detailing the strict IAM entity that clicked 'Approve' inside the heavily audited ticketing ecosystem, logging the exact payload directly into ELK.

It doesn't. A common fundamental misconception is that Terraform is cloud-agnostic. Terraform's syntax (HCL) is agnostic, but the explicit `aws_instance` resources are utterly useless on GCP. I would have to entirely rewrite the modules heavily shifting to `google_compute_instance` APIs, mapping identical abstract architectures but wielding completely different provider blocks entirely.

This is a catastrophic Cardinality Explosion. An application is likely injecting wild parameters (like UUIDs, timestamps, or IP addresses) dynamically into the metric labels, creating millions of globally unique internal memory arrays. I would aggressively configure `metric_relabel_configs` in Prometheus to forcefully drop or regex-slice the offending highly granular labels entirely before ingestion, securing the TSDB.

The pipeline strictly executes modularly. Phase 1: Pre-commit git hooks execute ultra-fast SAST (Static Analysis) and strict secrets detection (TruffleHog) locally. Phase 2: Jenkins clones the repository, strictly executing SonarQube quality gates. Phase 3: The Docker image is meticulously built and immediately ingested into a rigorous Image Scannner (Trivy/Clair). The pipeline halts dynamically if any CVE threshold exceeds 'High'. Phase 4: Container is strictly signed cryptographically via Sigstore/Cosign. Finally, the Kubernetes admission controller absolutely refuses to schedule any Pod whose image hash is not actively paired with a mathematically verified cryptographic signature, thoroughly preventing bypasses entirely.

Relying strictly on a single central DNS zone natively builds a global critical single point of failure. I would implement 'Anycast' networking aggressively combined with Border Gateway Protocol (BGP). Every distinct geographical region advertises the exact identically shared IP address entirely to the internet backbone. The global routers strictly handle forwarding packets to the topologically closest healthy data center natively. If a region burns down, the BGP routes are mathematically withdrawn, and the internet silently re-routes the traffic seamlessly bypassing the failing cluster.

Execing `docker exec -it` modifies the internal filesystem layer significantly, destroying forensic evidence fundamentally. I would entirely avoid touching the container APIs. Instead, I would map exactly to the underlying Linux worker node directly via SSH. I would meticulously utilize `nsenter` to jump strictly into the isolated namespace contexts (Network, PID, Mount) of the compromised container dynamically. From the host, I execute `tcpdump` tracking the raw outgoing C2 connections, and use `perf` or `strace` to securely observe the malicious binary without dropping any tooling inside the container ecosystem.

Massive scale fundamentally exhausts the etcd database backing the Kube-API. I would actively monitor the `etcd_disk_wal_fsync_duration_seconds` heavily. If the disk IO exceeds 10ms consistently, the entire cluster natively degrades. I must migrate the etcd nodes extensively onto bare-metal NVMe highly optimized architectures. Additionally, I would aggressively shard the Kubernetes API aggressively, implementing highly tuned API Priority and Fairness (APF) algorithms dropping unruly noisy-neighbor operators brutally before they can exhaust the critical system logic paths.

The microservice completely lacks a robust Idempotent Dead Letter Queue (DLQ) topology. I would aggressively mandate the microservice entirely disable Kafka natively 'Auto-Commit' offsets profoundly. The service must absolutely only explicitly commit the Kafka message offset to the broker *strictly* after the database transaction successfully returns an explicitly positive HTTP 200/ACK payload. If the database crashes, the offset remains uncommitted gracefully, and the microservice infinitely retries consuming the identical payload sequentially until the backend radically recovers.

A static Golang binary drastically compiles down to bare machine code featuring absolutely zero external C-library dependencies natively. This allows me to package the Golang workload exactly inside a `FROM scratch` 2-megabyte absolutely empty Docker container, providing an almost impenetrable attack surface completely devoid of a shell or package manager. The JVM explicitly requires heavy operating system layers, expansive garbage collection overhead, immense memory pre-allocation, and drastically slower container initialization (cold starts), demanding severe Kubernetes compute limit tuning compared to Go.

The paramount catastrophic challenge is strictly 'Data Gravity'. Synchronizing stateless application containers across clouds is trivial using Terraform and GitOps massively. However, actively mirroring Petabytes of stateful database transits across oceanic internet exchanges reliably introduces profound egress bandwidth costs and severe latency physics algorithms. To achieve strict seamless failover natively, you must intimately employ heavily tuned asynchronous CDC (Change Data Capture) replication pipelines spanning the exact providers, meticulously handling eventual consistency collision paradigms during split-brain scenarios.

I absolutely refuse giving human users cluster API credentials heavily. I extensively construct an Internal Developer Platform (IDP) utilizing heavily abstracted Backstage interfaces natively. The developer natively submits their Helm configuration exclusively via Git commits into an isolated repository. ArgoCD strictly operating under a confined RBAC ServiceAccount dynamically pulls the commit and explicitly attempts the deployment locally. If the chart severely requests elevated Privileged SecurityContexts, an OPA Gatekeeper strictly denies the deployment natively automatically before it ever evaluates the Kubelet API. The developer remains entirely outside the cluster.

I fundamentally migrate completely away from executing shell logic directly extensively. I would aggressively write the core logic deeply strictly encapsulated inside a Go CLI binary. Go fundamentally cross-compiles impeccably targeting the exact OS architectures completely devoid of runtime dependencies heavily. Alternatively, I explicitly wrap the deployment execution logic deeply exclusively inside a Docker runtime environment. The CI pipeline triggers a transient Alpine container strictly containing precisely exact versions of Python/Ansible needed natively to reliably orchestrate the deployment identically regardless of the underlying hardware operating system entirely.

Component redundancy radically alters the mathematical probability curve of system failure completely. The probability of two identical active 99.9% (0.1% failure natively) components crashing completely simultaneously assumes strict statistical independence heavily. The math resolves perfectly to 0.001 * 0.001 = 0.000001 (0.0001% failure probability explicitly). Thus, dual 99.9% clustered components natively provide robust 99.999% (Five Nines) composite availability strictly bypassing the exponentially more expensive single physical architecture completely, provided the load balancer itself implements absolute fault tolerance accurately.