Question 1

What is the difference between an On-Demand Instance and a Reserved Instance?

Accepted Answer

On-Demand Instances let you pay for compute capacity by the hour or second with no long-term commitments. They are ideal for applications with short-term, spiky, or unpredictable workloads that cannot be interrupted.

Reserved Instances provide significant discounts (up to 75%) compared to On-Demand pricing. They require a commitment of 1 or 3 years and are best for steady-state workloads with predictable usage.

Question 2

What are EC2 Spot Instances and when would you use them?

Accepted Answer

Spot Instances allow you to bid on spare Amazon EC2 computing capacity at up to 90% discount compared to On-Demand pricing. However, AWS can reclaim the instance with a 2-minute warning when it needs that capacity back.

Best used for: batch processing jobs, big data analytics, CI/CD build agents, stateless web servers, and any workload that is fault-tolerant and can be interrupted.

Question 3

How do you secure an Amazon S3 bucket?

Accepted Answer

You can secure an S3 bucket using multiple methods:

1. Use IAM Policies to grant specific users/roles access.
2. Apply Bucket Policies to restrict access across the entire bucket.
3. Enable Block Public Access to prevent accidental exposure.
4. Use S3 Access Control Lists (ACLs) for granular object-level permissions.
5. Enable Server-Side Encryption (SSE) using AWS KMS or S3-managed keys to encrypt data at rest.
6. Enforce HTTPS via Bucket Policies to encrypt data in transit.

Question 4

What is S3 Versioning and why would you enable it?

Accepted Answer

S3 Versioning is a feature that keeps multiple variants of an object in the same S3 bucket. When enabled, every time you overwrite or delete an object, S3 keeps the previous version(s).

You would enable it to:
1. Protect against accidental deletion or overwrites (act as a trash can).
2. Recover from unintended user actions or application failures.
3. Satisfy compliance requirements that mandate data retention.

Note: Once enabled, versioning cannot be disabled, only suspended.

Question 5

What is the difference between a Security Group and a Network ACL (NACL)?

Accepted Answer

Security Groups operate at the instance level (EC2). They are stateful, meaning if you allow traffic in, the return traffic is automatically allowed out. They only support 'allow' rules.

Network ACLs operate at the subnet level. They are stateless, meaning you must explicitly allow both inbound and outbound traffic. They support both 'allow' and 'deny' rules, making them useful for blocking specific rogue IP addresses.

Question 6

What is a VPC Peering connection? What are its limitations?

Accepted Answer

VPC Peering is a networking connection between two VPCs that enables routing traffic between them using private IPv4 or IPv6 addresses. Instances in either VPC can communicate with each other as if they are within the same network.

Key limitations:
1. No transitive peering — if VPC A is peered with B, and B is peered with C, A cannot communicate with C through B.
2. No overlapping CIDR blocks are allowed between the peered VPCs.
3. VPC Peering is region-specific by default (though cross-region peering is supported).
4. Does not support edge-to-edge routing (e.g., through a VPN or Direct Connect).

Question 7

What is the difference between an IAM Role and an IAM User?

Accepted Answer

An IAM User is a permanent identity for a person or application that needs long-term access to AWS resources. It has permanent credentials (password or access key).

An IAM Role is a temporary identity meant to be assumed by trusted entities (EC2 instances, Lambda functions, other AWS services, or users from another account). Roles do not have permanent credentials — they issue temporary security tokens via AWS STS (Security Token Service).

Best practice: Always prefer Roles for granting access to AWS services or cross-account access, and avoid using long-lived access keys for IAM Users.

Question 8

What is the Principle of Least Privilege and how do you implement it in AWS IAM?

Accepted Answer

The Principle of Least Privilege (PoLP) dictates that any user, application, or service should only be granted the minimum permissions required to perform its work — nothing more.

In AWS IAM, you implement this by:
1. Starting with no permissions (deny all by default) and granting only what is needed.
2. Using AWS Managed Policies as a baseline and creating Customer Managed Policies for fine-grained control.
3. Regularly reviewing and removing unused permissions using IAM Access Analyzer and IAM Credentials Report.
4. Using Service Control Policies (SCPs) in AWS Organizations to set permission guardrails.

Question 9

What causes a cold start in AWS Lambda and how can you mitigate it?

Accepted Answer

A 'cold start' occurs when AWS Lambda spins up a new execution environment to handle an invocation. This takes time, causing latency. It happens mostly when the function hasn't been invoked recently, or when scaling out to handle concurrent requests.

You can mitigate it by:
1. Using Provisioned Concurrency, which keeps a set number of execution environments initialized and ready to respond in double-digit milliseconds.
2. Optimizing package size and minimizing dependencies.
3. Using interpreted languages like Node.js or Python instead of compiled languages like Java, though AWS SnapStart can help Java specifically.

Question 10

What is the difference between Amazon RDS and Amazon DynamoDB?

Accepted Answer

Amazon RDS (Relational Database Service) is a managed service for relational databases (MySQL, PostgreSQL, Oracle, SQL Server, etc.). It is ideal for structured data with complex queries and relationships. It supports SQL and ACID transactions.

Amazon DynamoDB is a fully managed NoSQL key-value and document database. It is ideal for applications requiring single-digit millisecond performance at any scale, such as gaming leaderboards, shopping carts, and IoT data.

Choose RDS for structured, relational data with complex queries. Choose DynamoDB for high-throughput, low-latency access patterns with a simple data model.

Question 11

What is the difference between RTO and RPO in disaster recovery?

Accepted Answer

RTO (Recovery Time Objective) is the maximum acceptable time that a system can be offline after a disaster. It answers: 'How quickly must we recover?'

RPO (Recovery Point Objective) is the maximum acceptable amount of data loss measured in time. It answers: 'How much data can we afford to lose?'

For example, an RPO of 1 hour means your system must be backed up frequently enough that you never lose more than 1 hour's worth of data. Together, RTO and RPO drive your choice of disaster recovery strategy (Backup & Restore, Pilot Light, Warm Standby, or Multi-Site Active/Active).

Question 12

What is AWS CloudFormation and what is a 'stack'?

Accepted Answer

AWS CloudFormation is an Infrastructure as Code (IaC) service that allows you to define your AWS infrastructure in a declarative template file (YAML or JSON). CloudFormation reads the template and provisions the resources in the correct order, handling dependencies automatically.

A Stack is a single unit of related AWS resources, all managed together as one template. If you need to delete your environment, you can delete the stack and all associated resources are cleaned up automatically. Stacks can be nested for complex architectures.

Question 13

What is the difference between an Application Load Balancer (ALB) and a Network Load Balancer (NLB)?

Accepted Answer

ALB (Layer 7): Operates at the HTTP/HTTPS level. It can route traffic based on URL path, hostname, HTTP headers, and query strings. Best for microservices, container-based apps, and HTTP/HTTPS traffic.

NLB (Layer 4): Operates at the TCP/UDP/TLS level. It passes traffic through very quickly with ultra-low latency (handles millions of requests per second). Best for extreme performance requirements, gaming applications, and non-HTTP TCP protocols.

Question 14

How would you identify and reduce unused resources to optimize AWS costs?

Accepted Answer

A systematic cost optimization approach:

1. Use AWS Cost Explorer and AWS Trusted Advisor to identify idle/underutilized resources.
2. Find unattached EBS volumes and Elastic IP addresses (they still cost money when idle).
3. Right-size EC2 instances using CloudWatch metrics (CPU, memory utilization).
4. Convert On-Demand instances for predictable workloads to Reserved Instances or Savings Plans.
5. Use S3 Intelligent-Tiering or Lifecycle Policies to move infrequently accessed data to cheaper storage tiers.
6. Delete unused snapshots, old AMIs, and unused load balancers.

Question 15

What is the difference between AWS Regions and Availability Zones?

Accepted Answer

A Region is a large geographical area (e.g., us-east-1) containing multiple data centers. An Availability Zone (AZ) is a single, isolated data center or cluster of data centers within a Region. Regions provide global distribution, while AZs provide high availability within that specific geographic area.

Question 16

What is an IAM User and how does it differ from a Role?

Accepted Answer

An IAM User is an entity that represents a person or application with long-term credentials (password or access keys). A Role has short-term, temporary credentials and is designed to be assumed by trusted entities, such as an EC2 instance needing access to S3, without hardcoding keys.

Question 17

Explain the basic use case for Amazon S3 versus Amazon EBS.

Accepted Answer

Amazon S3 is object storage used for unstructured data, backups, and static website hosting because it is highly scalable and accessible via API over the internet. EBS is block storage representing a virtual hard drive that must be attached to a running EC2 instance to store its operating system and active file systems.

Question 18

What does a Load Balancer do in AWS?

Accepted Answer

An Elastic Load Balancer (ELB) automatically distributes incoming application traffic across multiple targets, such as EC2 instances, containers, or Lambda functions. It ensures no single instance is overwhelmed, providing fault tolerance.

Question 19

What is the purpose of a VPC Security Group?

Accepted Answer

A Security Group acts as a stateful, virtual firewall for your EC2 instances to control incoming and outgoing traffic. You must explicitly allow inbound traffic (e.g., allow port 443 for HTTPS); all inbound traffic is blocked by default.

Question 20

What is Amazon Route 53?

Accepted Answer

Route 53 is AWS's highly available and scalable cloud Domain Name System (DNS) web service. It translates human-readable domain names (like www.example.com) into IP addresses and handles global traffic routing.

Question 21

How does Auto Scaling work?

Accepted Answer

EC2 Auto Scaling monitors your applications and automatically adds or removes EC2 instances dynamically based on predefined conditions (like CPU utilization) to maintain steady, predictable performance at the lowest possible cost.

Question 22

What is the AWS Free Tier?

Accepted Answer

It allows customers to explore and try out AWS services free of charge up to specified limits for either 12 months, short-term trials, or infinitely (always free tier).

Question 23

What is a Relational Database Service (RDS)?

Accepted Answer

Amazon RDS is a managed service that makes it easy to set up, operate, and scale a relational database (like MySQL, PostgreSQL) in the cloud by automating patching, backups, and hardware provisioning.

Question 24

Explain what AWS Lambda is.

Accepted Answer

AWS Lambda is a serverless compute service that lets you run back-end code without provisioning or managing any underlying servers. You only pay for the exact compute time consumed while your code is running.

Question 25

A web application running on EC2 instances cannot connect to an RDS database in the same VPC. How do you troubleshoot this?

Accepted Answer

First, verify the RDS Security Group allows inbound traffic on the database port (e.g., 3306) specifically from the EC2 instance's Security Group ID. Second, verify the EC2 instance is deployed in a subnet that has routing access to the db subnet. Third, ensure NACLs are not blocking the traffic.

Question 26

You need to securely restrict an S3 bucket so that only a specific EC2 instance can read its contents. How do you implement this?

Accepted Answer

I would attach an IAM Role to the EC2 instance granting `s3:GetObject` permissions to that specific bucket's ARN. Then, to strictly enforce it, I would add a Bucket Policy to the S3 bucket explicitly denying all access unless the request originates from that specific IAM Role's ARN.

Question 27

Your team wants to run a background processing script daily without managing an EC2 server. What services would you use?

Accepted Answer

I would use Amazon EventBridge (formerly CloudWatch Events) to schedule a cron expression that triggers an AWS Lambda function. The Lambda function would contain the script, executing instantly and scaling transparently.

Question 28

Differentiate between an Application Load Balancer (ALB) and a Network Load Balancer (NLB).

Accepted Answer

An ALB operates at Layer 7, making routing decisions based on HTTP/HTTPS headers, paths, or query strings, ideal for microservices. An NLB operates at Layer 4 (TCP/UDP), handling millions of requests per second with ultra-low latency, ideal for pure connection-based routing.

Question 29

How do you securely SSH into a private EC2 instance without using a bastion host?

Accepted Answer

I would use AWS Systems Manager (SSM) Session Manager. It allows secure, auditable, browser-based or CLI interactive shell access to the instance without opening inbound port 22 or managing SSH key pairs.

Question 30

What is the difference between S3 Standard and S3 Intelligent-Tiering?

Accepted Answer

Standard is designed for frequently accessed data. Intelligent-Tiering automatically monitors access patterns and moves objects between frequent, infrequent, and archive access tiers without operational overhead or retrieval fees, saving money on data with unknown access patterns.

Question 31

A developer hardcoded AWS credentials into a public GitHub repo. What is your immediate incident response?

Accepted Answer

Instantly delete or deactivate the compromised IAM Access Keys in the AWS Console. Rotate the keys, force the developer to cycle local configurations, review AWS CloudTrail specifically for API calls made by those compromised keys to identify blast radius, and remove the secret from GitHub history.

Question 32

Explain how an Internet Gateway and NAT Gateway differ regarding subnetting.

Accepted Answer

An IGW is attached to a Public Subnet allowing two-way internet routing, enabling the instance to receive inbound public traffic. A NAT Gateway is placed IN the Public Subnet but used by instances in a Private Subnet strictly to initiate outbound internet requests (like downloading patches) without exposing themselves inbound.

Question 33

How would you automate the backup of critical EBS volumes?

Accepted Answer

I would use Amazon Data Lifecycle Manager (DLM) or AWS Backup to create automated snapshot policies. These policies would define the daily schedule and retention rules (e.g., keep the last 7 days of snapshots), ensuring compliance without manual cron jobs.

Question 34

You need to host a static HTML/JS frontend globally with HTTPS functionality. How do you design this?

Accepted Answer

I would upload the static assets to an S3 bucket configured for web hosting. Then, I would provision an Amazon CloudFront distribution, point the origin to the S3 bucket, configure an Origin Access Control (OAC) to secure the bucket, and attach a free ACM SSL certificate to CloudFront.

Question 35

Scenario: An application experiences extreme latency completely randomly during peak hours. The EC2 CPU is low, but the RDS Postgres CPU is spiking to 100%. How do you diagnose and fix this?

Accepted Answer

First, I use RDS Performance Insights and enable Enhanced Monitoring to identify the exact SQL query causing the CPU spike. Often, it's an unindexed table scan. If the query cannot be optimized or indexed, and it is overwhelmingly read-heavy, I would provision an RDS Read Replica and update the application code to route SELECT queries to the replica, offloading the Primary.

Question 36

Scenario: Your company is mandated to ensure that NO S3 data can ever traverse the public internet, even between your VPC and S3. How do you architect this?

Accepted Answer

I would implement a VPC Gateway Endpoint for Amazon S3. I would update the VPC route tables to route all S3 traffic instantly to the Gateway Endpoint. Finally, I would implement a strict S3 Bucket Policy using the `aws:SourceVpce` condition key, completely denying any access that does not flow through that specific VPC Endpoint ID.

Question 37

Scenario: You are migrating a monolithic application to a microservices architecture on ECS Fargate. The services need to securely communicate and resolve each other dynamically without hardcoded IPs. How do you design this?

Accepted Answer

I would use AWS Cloud Map for service discovery. As Fargate tasks spin up dynamically, they register their ephemeral IPs with Cloud Map. Other services can then resolve them via internal DNS namespaces (e.g., `backend.local.net`). To secure communication, I would employ AWS App Mesh to handle mTLS encryption and intelligent retries between the containers.

Question 38

Scenario: A serverless Lambda function connecting to a VPC RDS instance is causing severe database connection exhaustion during traffic spikes. How do you solve this?

Accepted Answer

Lambda functions can scale concurrently to thousands of instances, directly exhausting Postgres/MySQL connection limits. I would implement Amazon RDS Proxy. The proxy sits between Lambda and RDS, pooling and multiplexing the connections securely, preserving the database memory and preventing connection exhaustion.

Question 39

Scenario: A critical EC2 instance utilizing EBS gp2 volumes is hitting a strict IOPS bottleneck despite low CPU usage. Rebooting is prohibited. How do you solve this?

Accepted Answer

I would utilize the Elastic Volumes feature. Via the AWS Console or CLI, I can dynamically modify the EBS volume from `gp2` to `gp3` and provision explicit higher IOPS/throughput, or move to `io2` if the requirement is astronomical. This hot-modification works seamlessly while the volume is actively in use without impacting the running OS.

Question 40

Scenario: How do you architect a system to ingest 100,000 IoT messages per second, ensuring no data loss and smooth downstream database insertion?

Accepted Answer

I would route all IoT messages directly into an Amazon Kinesis Data Stream, which natively buffers massive throughput. A fleet of AWS Lambda functions would consume batches of messages from the Kinesis stream. The Lambdas would format the data and perform batch inserts into an Amazon DynamoDB table, preventing database throttling while guaranteeing data durability.

Question 41

Scenario: Your cost optimization report shows expensive cross-AZ data transfer fees. How do you identify the culprit and mitigate it?

Accepted Answer

I would enable VPC Flow Logs to trace the specific IP addresses generating cross-AZ traffic. Usually, it's chattiness between a web tier in AZ-A and a log aggregation service or database in AZ-B. I would mitigate this by ensuring services heavily utilize intra-AZ routing logically, enabling ALB cross-zone load balancing optimizations, or deploying caching layers localized strictly within the same AZ.

Question 42

Scenario: A CI/CD deployment accidentally destroyed a critical CloudFormation stack containing a production database. How could you have prevented the data deletion?

Accepted Answer

I should have applied a `DeletionPolicy: Retain` or `Snapshot` directly on the RDS resource within the CloudFormation template. This ensures that even if the stack is deleted, the database instance is left intact or a final snapshot is natively taken before deletion.

Question 43

Scenario: You need to process a 50GB CSV file nightly. Loading it entirely into Lambda memory fails. How do you architect the processing pipeline?

Accepted Answer

I would upload the file to S3, triggering an S3 Event Notification to AWS Step Functions or an AWS Batch job. AWS Batch would dynamically spin up an EC2 Spot instance with sufficient RAM, stream the file process block-by-block, write the results to a target datastore, and terminate the instance.

Question 44

Scenario: How do you enforce compliance where all EBS volumes MUST be encrypted across an entire AWS Organization?

Accepted Answer

I would use AWS Organizations Service Control Policies (SCPs) to explicitly deny the `ec2:RunInstances` or `ec2:CreateVolume` actions if the `ec2:Encrypted` condition boolean is false. Alternatively, I would simply enable the 'EBS Encryption by Default' toggle globally at the regional account level.

Question 45

Scenario: Design a multi-region Active-Active architecture with RPO near zero and RTO under 5 minutes for a tier-1 highly transactional trading platform.

Accepted Answer

I would utilize an Amazon Aurora Global Database to replicate extremely low-latency storage blocks globally asynchronously (sub-second lag) serving as the core state. The compute tier would run purely stateless microservices in EKS clusters across us-east-1 and eu-west-1. Route 53 latency-based routing combined with AWS Global Accelerator Anycast IPs would securely route users to the healthiest, nearest region. DynamoDB Global Tables could handle hyper-fast user session caching.

Question 46

Scenario: Your company merges with another enterprise resulting in overlapping VPC CIDR blocks (e.g., both use 10.0.0.0/16). Direct VPC Peering is mathematically impossible. How do you route specific API traffic between them?

Accepted Answer

I would implement AWS PrivateLink. In the destination VPC, I would create an internal Network Load Balancer (NLB) fronting the API and expose it as a VPC Endpoint Service. In the source VPC, I would instantiate an Interface VPC Endpoint. This securely maps the destination API to private IP addresses entirely valid within the source's CIDR, completely neutralizing the IP overlap issue without complex NAT masquerading.

Question 47

Scenario: A severe zero-day vulnerability requires immediately patching 5,000 EC2 instances scattered across 20 distinct AWS accounts. Manual SSH is impossible. How do you execute this fleet-wide?

Accepted Answer

I would leverage AWS Systems Manager (SSM) globally via AWS Organizations. First, I use AWS Config to dynamically identify the exact non-compliant drifted AMIs. Then, I utilize SSM Run Command or SSM Automation Documents to target the fleet concurrently via Resource Groups, silently executing the shell script to upgrade the package instantly without generating SSH keys or managing firewall ports.

Question 48

Scenario: You are designing a massive SaaS platform on EKS. How do you strictly isolate compute, network, and IAM privileges per tenant utilizing a shared-cluster architecture?

Accepted Answer

Multitenancy on EKS requires deep stratification. I would assign dedicated Kubernetes Namespaces per tenant. For compute isolation, I would use Taints/Tolerations to pin tenant Pods to specific Node Groups. For network isolation, I would enforce strict Calico NetworkPolicies denying cross-namespace traffic. For IAM, I would utilize IAM Roles for Service Accounts (IRSA), mapping the specific tenant's Kubernetes Service Account directly to an AWS IAM Role possessing least-privilege KMS and S3 policies specific to that tenant's exact data.

Question 49

Scenario: Explain the CAP theorem implications of choosing Amazon DynamoDB Global Tables versus an Aurora Multi-Master cluster.

Accepted Answer

DynamoDB Global Tables are master-master replicated globally, fundamentally prioritizing Availability and Partition tolerance (AP). Under global partitions, it strictly relies on Eventual Consistency and 'last-writer-wins' conflict resolution. Aurora Multi-Master operates synchronously within a single region, leaning towards Consistency and Partition tolerance (CP), providing strictly serializable transactions but risking immediate availability if the synchronized quorum layer experiences extreme latency.

Question 50

Scenario: A massive DDoS attack targets your ALB utilizing complex HTTP floods bypassing Layer 4 restrictions. How do you dynamically throttle only malicious signatures without blackholing legitimate traffic?

Accepted Answer

I would rapidly deploy AWS WAF (Web Application Firewall) attached to the ALB. I would implement AWS Managed Rules for DDoS protection to catch known botnets. For the zero-day HTTP flood, I would configure Rate-Based Rules in WAF explicitly triggering when requests from a single IP exceed strict thresholds natively. I would also integrate AWS Shield Advanced to gain proactive automated traffic engineering from the AWS DDoS Response Team.

Question 51

Scenario: Designing a purely event-driven architecture utilizing EventBridge, SQS, and Lambda, how do you mathematically guarantee message ordering and Exactly-Once processing?

Accepted Answer

Standard SQS guarantees at-least-once delivery, which can duplicate events. To guarantee ordering, I must utilize an SQS FIFO queue utilizing strict MessageGroupId routing. To guarantee Exactly-Once processing, AWS SQS FIFO handles deduplication based on a 5-minute deduplication ID window. However, because Lambdas can internally fail post-processing but pre-deletion, the true guarantee relies entirely on writing the Lambda logic to execute idempotently against a transactional datastore like DynamoDB.

Question 52

Scenario: You must migrate 500TB of highly sensitive medical data from an on-premises Hadoop cluster to S3 within 2 weeks. The internet pipe is only 1Gbps. Outline the strategy.

Accepted Answer

Transferring 500TB over 1Gbps would theoretically take over 45 days. The mathematical solution is utilizing physical offline transfer. I would order multiple AWS Snowball Edge Storage Optimized devices. I would encrypt the data heavily managed by on-premise HSMs before loading it onto the physical Snowball appliances using the native S3 adapter. Once shipped to AWS, the data is ingested into S3 securely, satisfying the stringent 14-day timeline bypass.

Question 53

Scenario: A serverless workload relying deeply on DynamoDB is producing severe Hot Partition throttling, crashing the application. How do you re-architect the schema design to prevent this?

Accepted Answer

A Hot Partition occurs when a bad Partition Key heavily concentrates read/write operations on a single physical node (e.g., querying strictly by a rapidly updating 'Status' attribute). I would re-architect the data model using Partition Key strategies like adding artificial suffixes (Sharding) to distribute heavily accessed items. If the workload is overwhelmingly read-heavy on specific hot keys, I would place DynamoDB Accelerator (DAX) in front of the tables to natively cache microsecond reads seamlessly.

Question 54

Scenario: Your CEO demands a strictly verifiable immutable audit trail of every single API call modifying infrastructure across exactly 50 AWS accounts, protected from even Root users. How?

Accepted Answer

I would establish a centralized AWS Organization CloudTrail logging natively to a dedicated, highly restricted specialized 'Security Tooling' AWS Account. The target S3 Bucket would strictly enforce S3 Object Lock in Compliance Mode with a 7-year retention constraint. Compliance Mode physically prevents any user, including the root user of the Security Account, from altering, deleting, or overwriting the CloudTrail log files mathematically over their lifespan.

AWS Interview Questions

Interview Questions Database

Filter by Experience Level