A comprehensive list of Cloud Engineering interview questions covering compute, storage, databases, networking, security, and cloud architecture across major providers.
Browse All CertificationsMaster your concepts with 52 hand-picked questions
IaaS (Infrastructure as a Service): Provides virtualized computing resources over the internet. You manage: OS, runtime, middleware, data, applications. The provider manages: servers, storage, networking, virtualization. Examples: AWS EC2, Azure VMs, GCP Compute Engine.
PaaS (Platform as a Service): Provides a platform allowing customers to develop, run, and manage applications without managing infrastructure. You manage: applications and data. Examples: AWS Elastic Beanstalk, Google App Engine, Heroku.
SaaS (Software as a Service): Provides software applications over the internet on a subscription basis. You manage: just the data you put in. Examples: Gmail, Salesforce, Microsoft 365.
Block Storage: Divides data into fixed-sized blocks. Very fast, low-latency, and is used for databases and OS disks. No metadata beyond a block address. Examples: AWS EBS, Azure Managed Disks.
File Storage: Organizes data in a hierarchical folder structure (like a traditional file system). Multiple servers can mount and share the same filesystem. Examples: AWS EFS, Azure Files, GCP Filestore.
Object Storage: Manages data as flat objects, each with data, metadata, and a unique ID/key. Massively scalable and cost-effective. Best for backups, static assets, media files. Examples: AWS S3, Azure Blob, GCP Cloud Storage.
Public Subnet: A subnet whose route table contains a route to an Internet Gateway (IGW). Resources in a public subnet (e.g., load balancers, bastion hosts) can send and receive traffic directly from the internet.
Private Subnet: A subnet with no direct route to the Internet Gateway. Resources in a private subnet (e.g., databases, application servers) cannot be directly accessed from the internet, providing an extra layer of security. They can still reach the internet for things like updates by routing outbound traffic through a NAT Gateway located in a public subnet.
The Shared Responsibility Model defines what the Cloud Provider is responsible for versus what the Customer is responsible for.
Provider ('Security OF the Cloud'): Physical security of data centers, hardware, hypervisor, and the global network infrastructure.
Customer ('Security IN the Cloud'): Everything built on top of the infrastructure — configuring security groups/firewalls, managing IAM permissions, encrypting data, patching guest operating systems, and securing their applications.
The division shifts based on service type: With IaaS (EC2), the customer has more responsibility. With PaaS (Elastic Beanstalk), the provider takes more. With SaaS (Gmail), the provider handles almost everything.
High Availability (HA): A system designed to minimize downtime and ensure continuous operation. HA systems accept that failures *will happen* but reduce their impact using redundancy and auto-recovery. Typical target: 99.9% to 99.99% uptime. Example: An Auto Scaling Group that automatically replaces a failed EC2 instance.
Fault Tolerance (FT): A system designed to continue operating *without interruption* even when components fail. There is zero downtime. This requires full hardware redundancy and is significantly more expensive. Example: NASA-style systems where a backup takes over in milliseconds.
Rule of thumb: HA minimizes downtime; FT eliminates it.
High Availability: Eliminate single points of failure by deploying resources across multiple Availability Zones (AZs). Use Auto Scaling Groups to automatically replace failed instances, and place a Load Balancer in front to distribute traffic.
Disaster Recovery: Prepare for catastrophic regional failures. DR strategies range from cheapest-to-slowest to most expensive-but-instant:
1. Backup & Restore: Periodic backups to S3. High RTO/RPO.
2. Pilot Light: Core infrastructure always on in a second region, scales up during DR.
3. Warm Standby: A scaled-down but fully functional copy is always running in a second region.
4. Multi-Site Active/Active: Both regions serve traffic simultaneously. Near-zero RTO/RPO.
Load Balancer (Layer 4/7): Its primary job is to distribute incoming traffic across healthy backend targets to ensure availability and scalability. ALBs can route based on HTTP path/headers.
API Gateway (Layer 7): A specialized service for API management that sits above a load balancer. Beyond routing, it provides: authentication/authorization, rate limiting & throttling, request/response transformation, API versioning, caching, and a centralized entry point for all client-to-service communication.
A common architecture: Client → API Gateway → Load Balancer → Application Servers.
ECS (Elastic Container Service): AWS's own proprietary container orchestration service. It's simpler to set up and deeply integrated with other AWS services (IAM, ALB, CloudWatch). No Kubernetes expertise required. Best for teams already invested in AWS who want simplicity.
EKS (Elastic Kubernetes Service): A managed Kubernetes service. It runs a standard, upstream Kubernetes control plane, so workloads are portable to any Kubernetes cluster (on-prem, other clouds). Best for teams with existing Kubernetes expertise, or those who need multi-cloud portability and advanced Kubernetes features.
The AWS Well-Architected Framework provides best practices across six pillars:
1. Operational Excellence: Run and monitor systems, gain insights, and continuously improve processes.
2. Security: Protect data, systems, and assets using risk assessments and security strategies.
3. Reliability: Ability to recover from failures and acquire computing resources to meet demand.
4. Performance Efficiency: Use computing resources efficiently to meet system requirements as demand changes.
5. Cost Optimization: Understand and control costs, and eliminate unnecessary spending.
6. Sustainability: Minimize environmental impacts of running cloud workloads.
Sharding is a horizontal scaling technique where a large database is partitioned into smaller, more manageable chunks called 'shards'. Each shard holds a subset of the total data and runs on a separate database server instance.
For example, a user database could be sharded by user ID: users 1-1,000,000 go to Shard 1, users 1,000,001-2,000,000 go to Shard 2, etc.
When to use: When a single database server can no longer handle the read/write throughput or storage requirements, and vertical scaling (bigger server) is no longer cost-effective or feasible.
Challenges: Cross-shard queries become complex, and resharding (when a shard grows too large) is very difficult.
Traditional Perimeter Security: Operates on the assumption 'trust but verify inside the network.' Once inside the corporate firewall/VPN, users and devices are trusted by default. This model fails when an attacker breaches the perimeter or when a malicious insider is involved.
Zero Trust: Operates on the principle 'never trust, always verify.' No user, device, or service is trusted by default, regardless of whether they are inside or outside the network perimeter.
Key principles:
1. Verify explicitly (authenticate and authorize every request based on identity, device health, location, etc.).
2. Use least-privilege access (limit access to only what is needed).
3. Assume breach (design systems as if the attacker is already inside).
Both are automation tools, but they serve different purposes:
Terraform (IaC — Infrastructure Provisioning): Declarative tool used to provision and manage infrastructure resources (VMs, networks, databases, DNS records). You describe the desired end state and Terraform figures out how to get there. It excels at creating, modifying, and destroying cloud resources.
Ansible (Configuration Management): Procedural/declarative tool used to configure and manage software *on* existing servers. Once Terraform has created a VM, Ansible can be used to install packages, configure services, manage users, deploy applications, etc.
Common pattern: Use Terraform to provision an EC2 instance, then use Ansible to configure it.
In IaaS, the cloud provider manages the physical hardware, hypervisor, and network, but I am responsible for managing the operating system, middleware, runtime, and the application. In PaaS, the provider goes a step further and manages the OS, runtime, and middleware as well, allowing me to focus entirely on writing and deploying the application code.
High Availability means designing the system to minimize downtime and avoid a single point of failure. Practically, for virtual machines, it means deploying multiple instances of the application across different physical racks (Availability Sets) or separate data centers within a region (Availability Zones) behind a load balancer, so if one VM or zone fails, the others continue serving traffic.
This is primarily an example of elasticity. Scalability is the overarching ability of a system to grow over time to handle increased workload. Elasticity is the specific, dynamic, and automated ability of a cloud system to scale resources out (up) and rapidly scale them back in (down) to match fluctuating traffic demands in real-time.
Subnets (Subnetworks) handle this. I would create a Virtual Private Cloud (VPC/VNet), create a public subnet for the web servers (with internet routing), and a private subnet for the database servers (without interent routing). The web servers can communicate with the databases internally, but the databases are blocked from inbound internet access.
I would choose Object Storage (like AWS S3 or Azure Blob Storage). Object storage is specifically designed to store vast amounts of unstructured data (like PDFs, images, videos) efficiently. It stores the data alongside customizable metadata (like the invoice_id), making it easily retrievable via HTTP REST APIs.
I shouldn't do this because it violates the Principle of Least Privilege. This principle dictates that a user, program, or process should be granted only the minimum level of access or permissions necessary to perform their specific job function. I should determine exactly what resources they need to see and grant read-only access to only those specific resources.
I should choose a Relational (SQL) database (like PostgreSQL or MySQL). SQL databases are specifically optimized for strictly structured data, complex relational queries (JOINS), and maintaining absolute data integrity through ACID (Atomicity, Consistency, Isolation, Durability) transactions.
The 'Pay-As-You-Go' (or consumption-based) pricing mechanism caused this. In the cloud, you are billed for compute resources as long as they are running, regardless of whether you are actively using them. To prevent this, I should automate the shutdown of non-production environments during off-hours using scripts or native cost management scheduling tools.
Containerization (like Docker) solves this by packaging the application code together with its exact dependencies, libraries, and runtime within a standardized, isolated unit (the container). Because the container carries its own environment, it will run identically regardless of whether it is on the developer's laptop or the production server.
No, it does not mean there are no servers. There are always physical servers running code. 'Serverless' is a cloud computing execution model where the cloud provider completely abstracts away server provisioning, management, and maintenance from the developer. The provider dynamically manages the allocation of machine resources, and you only pay for the exact milliseconds your code executes.
The primary benefit is independent deployments and decoupled lifecycles. In a monolith, any small change requires redeploying the entire application. In a microservices architecture, small, autonomous teams can develop, test, and deploy individual services (like the 'cart' service or 'checkout' service) independently of one another, vastly increasing deployment velocity and reducing release risk.
RTO is the maximum acceptable amount of downtime before the application must back up and running after a disaster (e.g., 'We must be back online in 4 hours'). RPO is the maximum acceptable amount of data loss, measured in time, denoting the point back in time to which data must be recovered (e.g., 'We can lose at most 15 minutes of transactional data').
I should introduce an in-memory caching layer (like Redis or Memcached). I would place it horizontally between the application servers and the database. The application will check the cache first (cache hit); if the data is there, it returns it instantly, completely bypassing the database. If not (cache miss), it queries the database and updates the cache for the next request.
A Content Delivery Network (CDN) or a Global Load Balancer solves this. A CDN will cache the static assets (images, CSS, JS) at edge locations close to the European users. Alternatively, a Global Load Balancer (like AWS Route53 or Azure Front Door) can route the users to completely separate instances of the application deployed in European data centers (geo-routing).
Because Terraform tracks the 'desired state' configuration versus the real-world state, it will detect this manual change as configuration 'drift'. The next time `terraform apply` is run, Terraform will flag the discrepancy and attempt to revert the security group rule back to the state defined in its configuration code, overwriting the junior engineer's manual change.
I must decouple the transcoding process using a message queue (like RabbitMQ, AWS SQS, or Azure Service Bus). When the user uploads the video, the backend immediately stores the raw file and drops a 'transcode job' message onto the queue, instantly returning a 'processing' response to the user. Independently, a pool of background worker servers pulls jobs from the queue, transcodes the video asynchronously, and notifies the user upon completion.
A Container Orchestration platform, most commonly Kubernetes, provides this. Kubernetes abstracts the underlying infrastructure, handles container deployment, auto-scaling, load balancing across containers, self-healing (restarting failed containers), and manages complex networking between the various microservice instances.
Data Partitioning or Sharding. If it's historical data, I would implement horizontal partitioning (Table Partitioning) by dividing the large tables based on a time key (e.g., partitioning by month), allowing the engine to scan only relevant chunks. If the issue is extreme read/write scale across distinct customers, I might use Sharding—splitting the entire database across separate physical database servers based on a shard key (like customer_id).
A safe CI/CD pipeline starts when a developer commits code to the main branch. The CI server (like GitHub Actions/Jenkins) detects the change, builds the source code, runs automated unit and integration tests. If tests pass, it packages the built artifact into a Docker container and pushes it to a Container Registry. The CD component then triggers, updating the Kubernetes deployment manifest to point to the new image tag, performing a rolling update to production without downtime.
An API Gateway pattern solves this. The API Gateway acts as a single, centralized entry point for all clients. It sits in front of the microservices and handles cross-cutting concerns uniformly, such as SSL termination, consolidated authentication/authorization, rate limiting, and request routing to the appropriate internal backend microservices.
I would architect a hybrid cloud solution utilizing a dedicated, private network connection like AWS Direct Connect or Azure ExpressRoute. This connects the on-prem data center directly to the cloud provider's network edge via a fiber-optic link, completely bypassing the public internet. This ensures consistent, predictable, low-latency performance and high security required for sensitive financial database transactions.
I would implement GitOps using a tool like ArgoCD or Flux. In GitOps, the Git repository is the single source of truth for declarative infrastructure and applications. Instead of pushing changes via `kubectl`, ArgoCD continuously monitors the Git repository. When a pull request is merged updating an application manifest, ArgoCD automatically detects the drift, pulls the changes, and synchronizes the Kubernetes cluster state to perfectly match what is defined in Git.
I would implement the Circuit Breaker pattern within Service A. The circuit breaker monitors for failures. If Service B fails a specified number of times, the circuit 'trips' (opens). Subsequent calls from Service A to Service B immediately fail fast, returning a fallback response instantly rather than waiting for timeouts. This preserves Service A's connection threads, preventing cascading failure. The circuit periodically sends test requests (half-open) to see if Service B has recovered before fully closing again.
I would implement a Zero Trust Network Access (ZTNA) model using an Identity-Aware Proxy (like Google IAP, AWS Verified Access, or Cloudflare Access). The proxy sits in front of the internal tools. Authentication is offloaded to a central Identity Provider (IdP) enforcing Multi-Factor Authentication. The proxy evaluates 'context' (user identity, device health, location) against granular policies for every single request before allowing access, totally eliminating the concept of a 'trusted internal network'.
Observability is missing, specifically Distributed Tracing. Traditional monitoring tells you 'if' a system is failing; observability tells you 'why'. In a complex distributed system, a single user request spans many functions, databases, and queues. Distributed tracing (like OpenTelemetry, Jaeger, or X-Ray) injects a unique trace ID into the request headers, following its path through every component, allowing me to pinpoint exactly which specific downstream microservice or query is bottlenecking the entire transaction.
I would implement the Event Sourcing pattern. Instead of storing just the current state (the balance), the database stores an immutable sequence of state-changing events (e.g., 'Deposited $50', 'Withdrew $10'). The current balance is derived by replaying these events. This provides an undisputed audit trail, allows for 'time-travel' debugging, and prevents data loss from accidental overwrites inherent in traditional CRUD operations.
For highly predictable, steady-state, 24/7 workloads, I must utilize Committed Use Discounts (like AWS Reserved Instances/Savings Plans, or Azure Reserved VM Instances). By committing to a specific volume of compute usage for a 1-year or 3-year term, the enterprise receives a massive discount (often 40-70% off) compared to dominant on-demand pricing, drastically reducing the baseline cloud spend.
This requires a DevSecOps pipeline. Initially, Container Scanning tools integrated into the Container Registry should have alerted us to the CVE. Procedurally, we patch the vulnerability in the centralized 'golden' base image repository. The automated CI/CD pipelines for all 50 microservices are triggered—either via automated dependency update bots (like Dependabot) or webhooks—forcing them to pull the new secure base image, rebuild, run unit tests, and perform rolling deployments via Kubernetes without manual intervention.
I require a globally distributed, multi-master NoSQL architecture (like Cassandra, Riak, or a cloud-native solution like Cosmos DB or AWS DynamoDB Global Tables). Multi-master allows writes to be ingested locally at the nearest regional node for sub-millisecond latency. To ensure users immediately see their updates, I would utilize 'Session Consistency', where the database guarantees read-your-own-writes for a specific client session while asynchronously replicating the data globally to achieve eventual consistency for everyone else.
I would implement a Service Mesh (like Istio or Linkerd). The service mesh injects a lightweight 'sidecar' proxy alongside every microservice container within the pod. The application only communicates over localhost to the sidecar. The sidecars form the 'Data Plane', intercepting all ingress and egress traffic, and communicating with a centralized 'Control Plane'. The control plane enforces mTLS, complex routing, retries, and telemetry universally, completely decoupling these networking logic requirements from the application codebase.
I would utilize a CNCF-aligned control plane architecture, often leveraging tools like Crossplane built atop a central Kubernetes cluster. The IDP utilizes a Developer Portal (like Backstage) offering self-service catalogs. When a developer requests infrastructure, Crossplane acts as the universal cloud orchestration engine. It translates the declarative YAML into infrastructure via cloud provider APIs. Crucially, I engineer OPA Gatekeeper (Open Policy Agent) within this central control plane. Gatekeeper acts as an admission controller, dynamically evaluating the request against compliance logic (e.g., 'all databases must be encrypted') across AWS, Azure, and GCP homogeneously before allowing the provisioning to proceed.
I must implement strict Bulkhead patterns and asynchronous backpressure mechanisms natively. At the API Gateway, I'd partition connection pools linearly relative to geographical/domain routing (the bulkhead) so a failure in the 'Driver Region A' pool cannot consume connections meant for 'Rider Region B'. Inside the mesh, I would replace synchronous wait-times with reactive, event-driven, non-blocking asynchronous architectures (like Reactive Extensions or Actor systems). When backends slow down, the system enforces backpressure, returning HTTP 429 Too Many Requests to clients instantly, shedding load proactively rather than queuing requests to the point of gateway memory exhaustion.
A Data Mesh treats data as a product. I would architect isolated 'Data Domains', each utilizing their own purpose-built storage (e.g., a BigQuery dataset or S3 bucket specific to the Marketing domain). Each domain team is fully responsible for ingesting, cleaning, and publishing their data via standardized, discoverable interfaces (Data Products). To manage governance centrally without centralizing the data, I would implement a global Data Catalog and Data Sharing platform. The federated model allows the central compliance team to apply organizational policies (like PII masking or access controls) logically at the catalog layer, automatically enforcing them across all underlying domain-owned data products.
I would define strict Service Level Indicators (SLIs), such as the percentage of HTTP requests resolving < 200ms. I set the SLO at 99.9% over a rolling 28-day window. Mathematically, the Error Budget is the remaining 0.1% of allowable failures. If the error budget is positive (e.g., we are at 99.95% reliability), the CI/CD pipeline remains unlocked—developers deploy freely. If recent outages deplete the budget below 0 (e.g., we hit 99.8%), the deployment pipeline automatically hard-locks via policy configuration. Feature work halts entirely, forcing the sprint priorities to shift 100% to technical debt and reliability engineering until the rolling 28-day window pushes failures out, regenerating the budget.
Navigating CAP theorem here requires sacrificing strict latency for Absolute Consistency and Partition Tolerance (CP). Architecture involves utilizing a globally distributed, synchronously replicated database like Google Cloud Spanner or CockroachDB. To achieve RPO=0 (no data loss) during a regional failure, quorum-based commits (e.g., Paxos/Raft) are mandatory. A transaction initiated in the US must synchronously acknowledge writes to both Europe and Asia before returning success to the client. This deeply impacts latency (constrained by the speed of light). Therefore, I must aggressively architect the application tier to be purely stateless, utilizing asynchronous UX patterns to mask the physical database commit latency from the end-user.
To conquer connection pooling limits, I utilize an external proxy (like AWS RDS Proxy), maintaining a steady pool to the database while multiplexing thousands of ephemeral function connections dynamically. For VPC IP exhaustion, I adopt an architecture that isolates functions from VPCs unless they absolutely require private resources within them; otherwise, I leverage newer provider capabilities (like AWS Hyperplane ENIs) that drastically reduce per-function IP requirements. For cold starts at scale, I pre-warm specific critical path functions using Provisioned Concurrency models, ensuring compute environments are initialized and waiting ahead of telemetry spikes.
I engineer a custom Kubernetes Operator using the Operator framework or Kubebuilder. By writing Custom Resource Definitions (CRDs), I extend the Kubernetes API allowing users to declare desired states for my specific application (e.g., kind: MyDatabaseCluster). I write a custom Controller in Go that continuously runs a reconciliation loop. This controller contains my domain-specific human operational knowledge encoded into software—it watches the CRD, calculates the drift, and autonomously executes the complex, ordered clustering API calls necessary to achieve and maintain that custom stateful configuration.
I would architect a Secure Access Service Edge (SASE) model paired with a global proxy layer (Cloudflare or Fastly). This puts the security perimeter not around the VPC, but at the globally distributed edge nodes. The CDN edge performs L3/L4 DDoS mitigation. It subsequently passes through Edge WAF rulesets for L7 protection and Bot Management challenges. Crucially, I enforce strict mutual TLS (mTLS) between the Edge network and my origin ingress controllers, ensuring the origins accept traffic *only* if cryptographically validated as originating from the sanitized edge, completely dropping requests originating from anywhere else on the internet.
Kafka's 'Exactly-Once' semantics handle stream-processing guarantees internally, but end-to-end idempotency requires architectural rigor at the consumer. I design Service B and C to be inherently idempotent. When consuming the event, the service begins a localized ACID transaction. It first checks for the existence of the event's unique 'MessageID' (or Idempotency Key) in a specialized 'processed_events' database table. If present, it gracefully acknowledges processing and drops the duplicate message. If absent, it processes the business logic and inserts the MessageID in the same atomic transaction, ensuring massive replay spikes cause zero side-effects to the downstream business state.
We re-architect by shifting from static centralized scraping to a scalable push/filtering topology at the edge. I deploy highly optimized agents (like OpenTelemetry Collector) as DaemonSets. These edge collectors perform aggressive aggregation, down-sampling, and dynamic filtering based on real-time rules before exporting data, dropping 90% of low-value 'debug' verbosity at the source. For metrics, I shard the ingestion layer utilizing horizontally scaling remote-write storage clusters (like Thanos or Cortex), enabling long-term storage of high-cardinality metrics across cheaper object storage tiers, rather than expensive block storage attached to single Prometheus instances.