I PASSED!
ELB + ASG
Why use a load balancer
- spread load across multiple downstream instances
- expose a single point of access (DNS) to your application
- seamlessly handle faliures of downstream instances
- do regular health checks to your instances
- provide SSL termination (HTTPS) for your websites
- enforce stickness with cookies
- high availability across zones
- separate public traffic from private traffic
ELB integrated with many AWS services
- EC2, ASG, ECS
- ACM, CloudWatch
- Route 53, WAF, Global Accelerator
ALB
- Layer 7 (HTTP)
- load balancing to multiple HTTP applications across machines (target groups)
- load balancing to multiple applications on the same machine (containers)
- support for HTTP and WebSockets
- support redirects (from HTTP to HTTPS)
- routing tables to different target groups
- based on path in URL
- based on hostname in URL
- routing based on query string, headers
- ALB are a great fit for micro services and container based application
- has a port mapping feature to redirect to a dynamic port in ECS
- in comparison, we need multiple CLB per application
- the application doesn’t see the client IP directly
Target groups
- EC2 instances
- ECS tasks
- lambda functions (HTTP request is translated into a JSON event)
- IP addresses - must be private IPs
- ALB can route to multiple target groups, each target group could have multiple instances
- health checks are at the target group level
- you can set rules to decide which target groups to redirect the traffic
NLB
- layer 4 (TCP and UDP)
- handle millions of request per second, lower latency
- NLB has one static IP per AZ, and supports assigning Elastic IP (helpful for whitelisting specific IP)
- NLB are used for extreme performance, TCP or UDP traffic
Sticky Sessions
- it is possible to implement stickness so that the same client is always redirected to the same instance behind a load balancer
- this works for CLB and ALB
- the cookie used for stickness has an expiration date you control
- use case: make sure the user doesn’t lose his session data
- enabling stickness may bring imbalance to the load over the backend EC2 instnaces
SSL - server name indication
- SNL solves the problem of loading multiple SSL certificates onto one web server (to serve multiple websites)
- it is a newer protocol, and requires the client to indicate the hostname and the target server in the initial SSL handshake
- the server will then find the correct certificate, or return the default one
- only works for ALB and NLB, CloudFront
Connection Draining (De-registration delay)
- time to complete the in-flight requests while the instance is de-registering or unhealthy
- stops sending new requests to the EC2 instances which is de-registering
- between 1 to 3600 seconds (default 300 seconds)
- can be disabled (set value to 0)
- set to a low value if your requests are short
- instances will be terminated after the draining time is over
X-Forwarded-For and X-Forwarded-Proto
- X-Forwarded-For
- The X-Forwarded-For (XFF) header is a de-facto standard header for identifying the originating IP address of a client connecting to a web server through an HTTP proxy or a load balancer. When traffic is intercepted between clients and servers, server access logs contain the IP address of the proxy or load balancer only. To see the original IP address of the client, the X-Forwarded-For request header is used.
- X-Forwarded-Proto
- The X-Forwarded-Proto (XFP) header is a de-facto standard header for identifying the protocol (HTTP or HTTPS) that a client used to connect to your proxy or load balancer. Your server access logs contain the protocol used between the server and the load balancer, but not the protocol used between the client and the load balancer. To determine the protocol used between the client and the load balancer, the X-Forwarded-Proto request header can be used.
ASG
- A launch configuration (launch template is the newer version)
- AMI + instance type
- EC2 user data
- EBS volume
- security groups
- SSH key pair
- min size, max size, initial capacity
- network + subnets information
- load balancer information (so ASG knows which target group to launch the instnace), ASG and ELB can be linked
- scaling policies
- it is possible to scale an ASG based on CloudWatch alarms
- an alarm monitors a metric (such as average CPU)
- metrics are computed for the overall ASG instnaces
- to update an ASG, you must provide a new launch configuration or new launch template
- IAM roles attached to an ASG will get assigned to EC2 instances launched
- ASG is free, you need to pay for the underlying resources launched
Scaling Policies
- Target tracking scaling
- most simple and easy to setup
- example: I want the average ASG CPU to stay at around 40%
- Simple / Step scaling
- when a CloudWatch alarm is triggered, then add 2 units
- when a CloudWatch alarm is triggered, then remove 1 unit
- the difference between simple and step scaling policies are: for step policy, you can create step adjustments, and ASG will change the number of instances based on the size of the alarm breach.
- scheduled actions
- anticipate a scaling based on known usage patterns
- example: increase the min capacity to 10 at 5pm on Fridays
- predictive scaling
- continuously forecast load and schedule scaling ahead
Good metrics to scale on
- CPU utilization
- request count per target
- average network in / out
Scaling cooldowns
- after a scaling policy happens, you are in the cooldown period (default is 300 seconds)
- during the cooldown period, the ASG will not launch or terminate additional instance (to allow for metrics to stablize)
- advice: use a ready to use AMI to reduce configuration time in order to be serving request faster and reduce the cooldown period
RDS
-
managed DB service for DB use SQL as a language
-
it allows you to create databases in the cloud that are managed by AWS
- Postgres
- MySQL
- MariaDB
- Oracle
- SQL server
- Aurora
-
RDS is a managed service
- automated provisioning, OS patching
- continuous backups and restore to specific timestamp (point in time restore)
- monitoring dashboards
- read replicas for improved read performance
- Multi AZ setup for DR
- maintenance windows for upgrades
- scaling capability
- storage backup by EBS
- RDS DB will be launched in a VPC in an AZ
-
you can’t SSH into your instances
RDS backups
- Automated backups
- daily full backup of the database
- transaction logs are backuped by RDS every 5 mins
- ability to restore to any point in time (from oldest to 5 mins ago)
- 7 days retention (can be increased to 35 days)
- DB snapshots
- manually triggered by the user
- retention of backup for as long as you want
RDS storage auto scaling
- helps you increase storage on your RDS DB instnace dynamically
- when RDS detects you are running out of free database storage, it scales automatically
- avoid manually scaling your database storage
- you have to set Maximum storage threshold
- automatically modify storage if
- free storage is less than 10% of allocated storage
- low storage lasts at least 5 mins
- 6 hours have passed since last modification
- useful for applications with unpredictable workloads
RDS read replicas for read scalability
- up to 5 read replicas
- within AZ, cross AZ or cross region
- replication is async, so reads are eventually consistent
- replicas can be promoted to their own DB
- applications must update the connection string to leverage read replicas
RDS read replicas - use case
- you have a production database, that is taking on normal load
- you want to run a reporting application to run some analytics
- you create a read replica to run the new workload there
- the production application is unaffected
- read replicas are used for SELECT only kind of statements
RDS read replicas - network cost
- in AWS there is a network cost when data goes from one AZ to another
- for RDS read replicas, within the same region, you don’t pay that fee, but you do need to pay if data goes to another region
RDS multi AZ (DR)
- SYNC replication
- one DNS name - automatic app failover to standby
- increase availability
- failover in case of loss of AZ, loss of network, instance or storage failure
- no manual intervention in apps
- not used for scaling (standby instance can’t be read or write)
- NOTE: the read replicas can be setup as Multi AZ for DR
RDS from single AZ to Multi AZ
- zero downtime operation (no need to stop the DB)
- just click on modify for the database
- the following happens internally
- a snapshot is taken
- a new DB is restored from the snapshot in a new AZ
- synchronization is established between the two databases
RDS security - encryption
- at rest encryption
- possibility to encrypt the master and read replicas with AWS KMS - AES-256 encryption
- encryption has to be defined at launch time
- if the master is not encrypted, the read replicas cannot be encrypted
- in flight encryption
- SSL certificates to encrypt data to RDS in flight
- provide SSL options with trust certificate when connecting to database
RDS encryption operations
- encrypting RDS backups
- snapshots of unencrypted RDS databases are un-encrypted
- snapshots of encrypted RDS databases are encrypted
- can copy a snapshot into an encrypted one
- to encrypt an un-encrypted RDS database
- create a snapshot of the un-encrypted database
- copy the snapshot and enable encryption for the snapshot
- restore the database from the encrypted snapshot
- migrate applications to the new database, and delete the old database
RDS security - Network and IAM
- network security
- RDS databases are usually deployed within a private subnet, not in a public one
- RDS security works by leveraging security groups (the same concept as for EC2 instances) - it controls which IP / security group can communicate with RDS
- Access management
- IAM policies help control who can manage AWS RDS
- traditional username and password can be used to login into the database
- IAM based authentication can be used to login into RDS MySQL and postgreSQL
Aurora
- Aurora is a proprietary technology from AWS
- postgres and MySQL are both supported as Aurora DB
- Aurora is AWS cloud optimized and claims 5x performance improvement over MySQL on RDS, over 3x performance of Postgres on RDS
- Aurora storage automatically grows in increments of 10GB, up to 64TB
- Aurora can have 15 replicas while MySQL has 5, and the replication process is faster
- failover in Aurora is instantaneous, it is HA native
- user to connect to write endpoint or read endpoint, these endpoints will redirect traffic to the correct instances.
- Security is the same as RDS
- Aurora has 4 features
- one writer, multiple reader
- one writer, multiple readers - parallel query
- multiple writers
- severless
ElastiCache
- the same way RDS is to get managed relational databases
- ElatiCache is to get managed Redis and Memcachced
- Caches are in memory databases with really high performance, low latency
- helps reduce load off of databases for read intensive workloads
- helps make your application stateless
- AWS takes care of OS maintenance and patching, optimizations, setup, configuration, monitoring, failure recovery and backups
- using ElastiCache involves heavy application code changes
DB cache
- Application queries ElastiCache, if not avaialble, get from RDS and store in ElastiCache
- helps relieve load in RDS
- cache must have an invalidation strategy to make sure only the most current data is used in there
User session store
- user logs into any of the application
- the application writes the session data into ElastiCache
- the user hits another instance of our application
- the instance retrieves the data and the user is already logged in
Redis vs Memcached
Redis | Memcached |
---|---|
Multi AZ with auto failover | multi node for partitioning of data (sharding) |
read replicas to scale reads and have HA | no HA |
data durability using AOF persistence | non persistent |
backup and restore features | no back and restore |
- | Multi threaded architecture |
- Redis Auth
- if you enable encryption in transit, you can enable Redis Auth, you need to setup a token for your application to connect to Redis.
Caching implementation considerations
- is it safe to cache data
- data may be out of data, eventaully consistent
- is caching effective for that data
- pattern: data changing slowly, few keys are frequently needed, good to use caching
- anti pattern: data changing rapidly, all large keys space frequently needed, not good to use caching
- is data structured well for caching?
- key value caching, or caching of aggregations results?
- caching is good for well structured data
lazy loading / cache aside / lazy population
1 | # python |
write through - add or update cache when database is updated
- when there is a write call
- write to DB
- write to cache
- pros
- data in cache is never stale, reads are quick
- wirte penalty vs read penalty (each write requires 2 calls)
- cons
- missing data until it is added/ updated in the DB, mitigation is to implement lazy loading strategy as well, combine 2 strategies together
- cache churn - a lot of the data will never be read
1 | # python |
cache evictions and TTL (time to live)
- cache eviction can occur in 3 ways
- you delete item explicity in the cache
- item is evicted because the memory is full and it is not recently used (LRU)
- you set an item TTL
- TTL are helpful for any kind of data
- leaderboard
- comments
- activity streams
- TTL can range from few seconds to hours or days
- if too many evictions happen due to memory, you should scale up or out
Final words
- lazy loading is easy to implement and works for many situations as a foundation, especially on the read side
- write through is usually combined with lazy loading as targeted for the queries or workloads that benefit from this optimization
- setting a TTL is usually not a bad idea, except when you are using write through, set it to a sensible value for your application
- only cache data that makes sense
ElastiCache replication - cluster mode disabled
- one primary node, up to 5 replicas
- asynchronous replication
- the primary node is used for read and write
- the other nodes are read only
- we have only one shard, and all nodes are in the shard, each node has all the data
- guard against data loss if node failure
- multi AZ enabled by default for failover
- helpful to scale read performance
ElastiCache replication - cluster mode enabled
- data is partitioned across shards (helpful to scale writes)
- each shard has a primary and up to 5 replica nodes, each shard has part of the data
- multi AZ capability
- up to 500 nodes per cluster
Route 53
-
DNS
- Domain Name System which translates the human friendly hostnames into the machine IP addresses
-
Route 53
- A highly available, scalable, fully managed and authoritive DNS
- route 53 is also a Domain Registrar
- ability to check the health of your resources
- the only AWS service which provides 100% availability SLA
Hosted zones
- a container for records that define how to route traffic to a domain and its subdomains
- public hosted zones
- contains records that specify how to route traffic on the internet (public domain names)
- private hosted zones
- contain records that specify how you route traffic within one or more VPCs (private domain names)
CNAME vs alias
-
CNAME
- points a hostname to any other hostname
- only for non root domain
-
Alias
- points a hostname to an AWS resource
- works for root domain and non root domain
- free of charge
- native health check
- automatically recognizes changes in the resource’s IP addresses
-
Alias targets
- ELB
- CloudFront distributions
- API gateway
- Elastic Beanstalk environments
- S3 websites
- VPC interface endpoints
- Global accelerator
- route 53 record in the same hosted zone
-
You cannot set an Alias record for an EC2 DNS name
Route 53 - routing policies
- define how route 53 responds to DNS queries
- route 53 supports the following routing policies
- simple
- weighted
- failover
- latency based
- geolocation
- multi value answer
- geoproximity (using route 53 traffic flow feature)
Simple
- typically, route traffic to a single resource
- can specify multiple values in the same record
- if multiple values are returned, a random one is chosen by the client
- when Alias enabled, sepecify only one AWS resource
- can’t be associated with health checks
Weighted
- control the percentage of the requests that go to each specific resource
- assign each record a relative weight
- DNS records must have the same name and type
- use cases: load balancing between regions, testing new application versions…
- assign a weight of 0 to a record to stop sending traffic to a resource
- if all records have weight of 0, then all records will be returned equally
latency
- redirect to the resource that has the least latency close to us
- super helpful when latency for users is a priority
- latency is based on traffic between users and AWS regions
- Germany users may be directed to the US
- can be associated with health checks (has a failover capability)
health checks
- HTTP health checks are only for public resources
- health check => automated DNS failover
- health checks that monitor an endpoint
- health checks that monitor other health checks (calculated health checks)
- health checks that monitor cloudwatch alarms
- health checks are integrated with CW metrics
monitor an endpoint
- about 15 global health checks will check the endpoint health
- healthy / unhealthy threshold = 3
- interval - 30 seconds
- supported protocol: HTTP, HTTPS, TCP
- if > 18% of health checks report the endpoint is healthy, route 53 consider it is healthy, otherwise, it is unhealthy
- ability to choose which locations you want route 53 to use
- health checks pass only when the endpoint responds with the 2xx and 3xx status codes
- health checks can be setup to pass or fail based on the text in the first 5120 bytes of the response
- your ELB must allow the incoming requests from the route 53 health checkers IP address range
calculated health checks
- combine the results of multiple health checks into a single health check
- you can use OR, AND, or NOT
- can monitor up to 256 child health checks
- specify how many of the health checks need to pass to make the parent pass
- usage: perform maintenance to your website without causing all health checks to fail
private hosted zones
- route 53 health checks are outside the VPC
- they can’t access private endpoint
- you can create a CloudWatch metric and associate a CloudWatch alarm, then create a health check that checks the alarm itself, if the CloudWatch alarm status becomes ALARM, the health checker will become to unhealthy
failover
- create two records associate with 2 resources
- primary and secondary records
- primary record must associated with a health checker
- if the primary record is unhealthy, DNS will return IP address of the secondary resource
Geolocation
- different from latency based
- this routing is based on user location
- specify location by Continent, Country, or by US state
- should create a default record (in case there is no match on location)
- use cases: website localization, restrict content distribution, load balancing…
- can be associated with health checks
Geoproximity
- route traffic to your resources based on the geographic location of users and resources
- ability to shift more to resources based on the defined bias
- to change the size of the geographic region, specify bias values
- to expand (1 to 99), more traffic to the resource
- to shrink (-1 to -99), less traffic to the resource
- resource can be
- AWS resources (AWS region)
- non AWS resources (latitude and longitude)
- you must use route 53 traffic flow (advanced) to use this feature
Traffic flow
- simplify the process of creating and maintaing records in large and complex configurations
- visual editor to manage complex routing decision trees
- configurations can be saved as traffic flow policy
- can be applied to different route 53 hosted zones
- supports versioning
multi value
- use when routing traffic to multiple resources
- route 53 return multiple values / resources
- can be associated with health checks (return only values for healthy resources)
- up to 8 healthy records are returned for each multi value query
- multi value is not a substitude for having an ELB (it is more like a client side load balancing)
VPC
- VPC: private network to deploy your resource
- subnet: allow you to partition your network inside your VPC (AZ resource)
- a public subnet is a subnet that is accessible from the internet
- a private subnet is a subnet that is not accessible from the internet
- to define access to the internet and between subnets, we use route tables
Internet gateway and NAT gateways
- internet gateways helps our VPC instances connect with the internet
- public subnets have a route to the internet gateway
- NAT gateways (AWS managed) and NAT instances (self managed) allow your instances in your private subnets to access the internet while remaining private
NACL and security groups
- NACL (network ACL)
- a firewall which controls traffic from and to subnet
- can have ALLOW and DENY rules
- are attached at the subnet level
- rules only include IP addresses
- security groups
- a firewall that controls traffic to and from an ENI / an EC2 instance
- can have only ALLOW rules
- rules include IP addresses and other security groups
Security group | Network ACL |
---|---|
operates at the instance level | operates at the subnet level |
supports allow rules only | supports allow rules and deny rules |
is stateful: return traffic is automatically allowed, regardless of any rules | is stateless: return traffic must be explicitly allowed by rules |
we evaluate all rules before deciding whether to allow traffic | we process rules in number order when deciding whether to allow traffic |
applies to an instance only if someone specifies the security group when launching the instance, or associate the security group with the instance later on | automatically applies to all instances in the subnets it’s associated with (therefore, you don’t have to rely on users to specify the security group) |
VPC Flow logs
- capture information about IP traffic going into your interfaces
- VPC flow logs
- subnet flow logs
- ENI (elastic network interface) flow logs
- helps to monitor and troubleshoot connectivity issues
- subnets to internet
- subnets to subnets
- internet to subnets
- captures network information from AWS managed interfaces too: Elastic load balancers, ElastiCache, RDS, Aurora, etc…
- VPC flow logs data can go to S3 / CloudWatch logs
VPC Peering
- connect two VPCs, privately using AWS network
- make them behave as if they were in the same network
- must not have overlapping CIDR
- VPC peering connection is not transitive (must be established for each VPC that need to communicate with one another)
VPC endpoints
- endpoints allow you to connect to AWS services using a private network instead of the public www network
- this gives you enhanced security and lower latency to access AWS services
- VPC endpoint gateway: S3 and DynamoDB
- VPC endpoint interface: the rest AWS rervices
- only used within your VPC
Site to site VPN and Direct connect
- site to site VPN
- connect an on premises VPN to AWS
- the connection is automatically encrypted
- goes over the public internet
- direct connect
- establish a physical connection between on premises and AWS
- the connection is private, secure, and fast
- goes over a private network
- takes at least a month to establish
- NOTE: site to site VPN and direct connect cannot access VPC endpoints
S3
buckets
- S3 allows people to store objects in buckets
- buckets must have a globally unique name
- buckets are defined at the region level
objects
- objects have a key
- the key is the FULL path
- the key is composed of prefix + object name
- there is no concept of directories within buckets
- just keys with very long names that contain slashes
- object values are the content of the body
- max object size is 5TB
- if uploading more than 5GB, must use multi-part upload
- metadata (list of text key / value pairs - system or user metadata)
- tags (unicode key / value pair, up to 10) - useful for security / lifecycle
- version ID (if versioning is enabled)
versioning
- you can version your files in S3
- it is enabled at the bucket level
- same key overwrite will increment the version: 1,2,3…
- it is best practice to version your buckets
- protect against unintended deletes
- easy roll back to previous version
- note:
- any file that is not versioned prior to enabling versioning will have version null
- suspending versioning does not delete the previous versions
Encryption for objects
SSE-S3
- encryption using keys handled and managed by S3
- object is encrypted server side
- AES-256 encryption type
SSE-KMS
- encryption using keys handled and managed by KMS
- KMS advantages: user control + audit trail
- object is encrypted server side
SSE-C
- server side encryption using data keys fully managed by the customer outside of AWS
- S3 does not store the encryption key you provide
- HTTPS must be used, because you need to send the encryption key in the header
- encryption key must be provided in HTTP headers, for every HTTP request made
client side encryption
- client library such as the Amazon S3 encryption client
- clients must encrypt data themselves before sending to S3
- clients must decrypt the data themselves when retrieving from S3
- customer fully manages the keys and encryption cycle
Encryption in transit (SSL/TLS)
- Amazon S3 exposes
- HTTP endpoint: non encrypted
- HTTPS endpoint: encryption in flight
- you are free to use the endpoint you want, but HTTPS is recommended
- most clients would use the HTTPS endpoint by default
- HTTPS is mandatory for SSE-C
security
- user based
- IAM policies - which API calls should be allowed for a specific user from IAM console
- resource based
- bucket policies - bucket wide rules from the S3 console - allows cross account
- object ACL - finer grain
- bucket ACL - less common
- NOTE: an IAM principal can access an S3 object if
- the user IAM permissions allow it OR the resource policy alloow it
- AND there is no explicit DENY
bucket settings for block public access
- block public access to buckets and objects granted through
- new access control lists
- any access control lists
- new public bucket or access point policies
- block public and cross account access to buckets and objects through any public bucket or access point policies
- these settings were created to prevent company data leaks
- if you know your bucket should never be public, leave these on
- can be set at the account level
others
- networking
- supports VPC endpoints (for instances in VPC without www internet)
- logging and audit
- S3 access logs can be stored in other S3 buckets
- API calls can be logged in AWS cloudtrail
- user security
- MFA delete: MFA can be required in versioned buckets to delete objects
- pre-signed URLs: URLs that are valid for a limited time (premium videos service for logged in users)
CORS
- an origin is a scheme, host, and port
- CORS means cross origin resource sharing
- web browser based mechanism to allow requests to other origins while visiting the main origin
- the requests won’t be fulfilled unless the other origin allows for the requests, using CORS headers
- if a client does a cross origin request on our S3 bucket, we need to enable the correct CORS headers
- you can allow for a specific origin or for * (for all origins)
consistency model
- strong consistency as of Dec 2020
AWS CLI, SDK, IAM Roles and policies
- AWS CLI Dry run
- tells you if your command would have succeed or not without actually executing it
- AWS CLI STS decode erros
- decode API error messgaes using the STS command line
AWS EC2 instance metadata
- it allows AWS EC2 instance to learn about themselves without using an IAM role for that purpose
- the URL is http://169.254.169.254/latest/meta-data/
- you can retrieve the IAM role name from the metadata, but you cannot retrieve the IAM policy
- metadata = info about the EC2 instance
- user data = launch script of the EC2 instance
MFA with CLI
- to use MFA with the CLI, you must create a temporary session
- to do so, you must run the STS GetSessionToken API call
AWS SDK
- what if you want to perform actions on AWS directly from your applications code?
- you can use an SDK
- we have to use the AWS SDK when coding against AWS services such as DynamoDB
- if you don’t specify or configure a default region, then us-east-1 will be chosen by default
AWS limit
- API rate limits
- DescribeInstances API for EC2 has a limit of 100 calls per seconds
- GetObject on S3 has a limit of 5500 GET per second per prefix
- for intermittent errors: implement
exponential backoff
- for consistent errors: request an API throttling limit increase
- service quotas
- running on-demand standard instances: 1152 vCPU
- you can request a service limit increase by opening a ticket
- you can request a service quota increase by using the service quotas API
Exponential Backoff
- if you get ThrottlingException intermittently, use exponential backoff
- retry mechanism already included in AWS SDK API calls
- must implement yourself if using the AWS API as-is or in specific cases
- must only implement the retries on 5xx server errors and throttling
- do not implement on the 4xx client errors
AWS CLI credentials provider chain
- the CLI will look for credentials in this order
- command line options
- environment variables
- CLI credentials file
- CLI configuration file
- container credentials
- instance profile credentials
AWS SDK default credentials provider chain
- the java SDK will look for credentials in this order
- java system properties
- environment variables
- the default credential profiles file
- Amazon ECS container credentials
- instance profile credentials
Credentials Scenario
-
an application deployed on an EC2 instance is using environment variables with credentials from an IAM user to call the Amazon S3 API
-
The IAM user has S3FullAccess permissions
-
the application only uses one S3 bucket, so according to best practices
- an IAM role and EC2 instance profile was created for the EC2 instance
- the role was assigned the minimum permissions to access that one S3 bucket
-
the IAM instance profile was assigned to the EC2 instance, but it still had access to all S3 buckets, why?
-
the credentials provider chain is still giving priorities to the environment variables
credentials best practice
- never store AWS credentials in your code
- best practice is for credentials to be inherited from the credentials chain
- if working within AWS, use IAM roles
- EC2 instance roles for EC2 instances
- ECS roles for ECS tasks
- lambda roles for lambda functions
- if working outside AWS, use environment variables / named profiles
signing AWS API requests
- when you call the AWS HTTP API, you sign the request so that AWS can identify you, using your AWS credentials (access key and secret key)
- note: some requests to Amazon S3 don’t need to be signed
- if you use the SDK or CLI, the HTTP requests are signed for you
- you should sign an AWS HTTP request using Signature v4 (SigV4)
sigV4 options
- HTTP header
- query string in URL
S3 and Athena Advanced
S3 MFA delete
- MFA forces user to generate a code on a device before doing important operations on S3
- to use MFA delete, we need to enable versioning on the S3 bucket
- you will need MFA to
- permanently delete an object version
- suspend versioning on the bucket
- you won’t need MFA for
- enabling versioning
- listing deleted versions
- only the bucket owner can enable / disable MFA delete
- MFA delete can only be enabled using the CLI
S3 default encryption vs bucket policies
- one way to force encryption is to use a bucket policy and refuse any API call to PUT an S3 object without encryption headers
- another way is to use the default encryption option in S3
- note: bucket policies are evaluated before default encryption
S3 access logs
- for audit purpose, you may want to log all access to S3 buckets
- any request made to S3, from any account, authorized or denied, will be logged into another S3 bucket
- that data can be analyzed using data analysis tools…
- or Amazon Athena
- do not set your logging bucket to be the monitored bucket
- it will create a logging loop, and your bucket will grow in size exponentially
S3 replication (CRR or SRR)
-
must enable versioning in source and destination
-
cross region replication - CRR
-
same region replication - SRR
-
buckets can be in different accounts
-
copying is asynchronous
-
must give proper IAM permissions to S3
-
CRR use cases: compliance, lower latency access for users in another region, replicatioin across accounts
-
SRR use cases: log aggregation, live replication between production and test accounts
-
after activating, only new objects are replicated (not retroactive), existing objects will not be replicated
-
for DELETE operations
- can replicate delete markers from source to target (optional setting)
- deletions with a version ID are not replicated (to avoid malicious deletes), it means if you delete an object using its version ID, this operation will not be replicated
-
there is no chaining of replication
- if bucket 1 has replication into bucket 2, which has replication into bucket 3
- then objects created in bucket 1 are not replicated to bucket 3
S3 pre signed URLs
- can generate per signed URLs using SDK or CLI
- for downloads (easy, can use the CLI)
- for uploads (harder, must use the SDK)
- valid for a default of 3600 seconds, can change timeout with
--expires-in [TIME_BY_SECONDS]
argument - users given a pre signed URL inherit the permissions of the person who generated the URL for GET / PUT
- examples
- allow only logged in users to download a permium video on your S3 buckets
- allow an ever changing list of users to download files by generating URLs dynamically
- allow temporarily a user to upload a file to a precise location in our bucket
Amazon Glacier and Glacier Deep Archive
- Amazon Glacier - 3 retrival options
- expedited - 1 to 5 mins
- standard - 3 to 5 hours
- bulk - 5 to 12 hours
- minimum storage duration of 90 days
- Amazon Glacier Deep Archive - for long term storage - cheaper
- standard - 12 hours
- bulk - 48 hours
- minimum storage duration of 180 days
S3 lifecycle rules
- transition actions: it defines when objects are transitioned to another storage class
- move objects to standard IA class 60 days after creation
- move to Glacier for archiving after 6 months
- expiration actions: configure objects to expire (delete) after some time
- access log files can be set to delete after a year
- can be used to delete old versions of files (if versioning is enabled)
- can be used to delete incomplete multi part uploads
- rules can be created for a certain prefix
- rules can be created for certain object tags
S3 performance
- Amazon S3 automatically scales to high request rates, latency 100-200ms
- your application can achieve at least 3500 PUT/COPY/POST/DELETE and 5500 GET/HEAD requests per second per prefix in a bucket
- there are no limits to the number of prefixes (prefix is a folder in S3) in a bucket
KMS limitation
- if you SSE-KMS, S3 performance may be impacted by the KMS limits
- when you upload, it calls the GenerateDataKey KMS API
- when you download it, it calls the Decrypt KMS API
- count towards the KMS quota per second (5500, 10000, 30000 based on region)
- you can request a quota increase using the service quotas console
multi part upload
- recommended for files > 100MB, must be used for files > 5GB
- can help parallelize uploads (speed up transfers)
S3 transfer acceleration
- increase transfer speed by transferring file to an AWS edge location which will forward the data to the S3 bucket in the target region
- compatible with multi part upload
S3 byte range fetches
- parallelize GETs by requesting specific byte ranges
- better resilience in case of failures
- can be used to speed up downloads
- can be used to retrieve only partial data (for example the head of a file)
S3 select and Glacier select
- retrieve less data using SQL by performing servide side filtering
- can filter by rows and columns (complex query not supported)
- less network transfer, less CPU cost client side
S3 event notifications
- use case: generate thumbnails of images uploaded to S3
- can create as many S3 events as desired
- S3 event notifications typically deliver events in seconds but can sometimes take a minute or longer
- it two writes are made to a single non-versioned object at the same time, it is possible that only a single event notification will be sent
- if you want to ensure that an event notification is sent for every successful write, you can enable versioning on your bucket
- compared to CloudWatch event or EventBridge, S3 event notifications have lower latency and lower costs, it works better with S3
AWS Athena
- serverless service to perform analytics directly against S3 files
- uses SQL language to query the files
- has a JDBC / ODBC driver
- charged per query and amount of data scanned
- supports CSV, JSON, ORC, Avro and Parquet (built on Presto)
- use cases: business intelligence, analytics, reporting, analyze and query, VPC flow logs, ELB logs, CloudTrail trails, etc…
- exam tip: analyze data directly on S3 => use Athena
AWS CloudFront
- CDN
- improves read performance, content is cached at the edge
- 216 point of presence globally (edge locations)
- DDoS protection, integration, with Shield, AWS WAF
- can expose external HTTPS and can talk to internal HTTPS backends
Origins
- S3 bucket
- for distributing files and caching them at the edge
- enhanced security with CloudFront OAI
- CloudFront can be used as ingress (to upload files to S3)
- Custom Origin (HTTP)
- ALB
- EC2 instance
- S3 website (must first enable the bucket as a static S3 website)
- any HTTP backend you want
CloudFront Geo Restriction
- you can restrict who can access your distribution
- whitelist: allow your users to access your content only if they are in one of the countries on a list of approved countries
- blacklist: prevent your users from accessing your content if they are in one of the countries on a blacklist of banned countries
- the country is determined using a third party geo IP database
- use case: copyright laws to control access to content
CloudFront vs S3 CRR
- CloudFront
- global edge network
- files are cached for a TTL
- great for static content that must be available everywhere
- S3 CRR
- must be setup for each region you want to replication to happen
- files are updated in near real time
- read only
- great for dynamic content that needs to be available at low latency in few regions
CloudFront caching
- cache based on
- headers
- session cookies
- query string parameters
- the cache lives at each CloudFront edge location
- you want to maximize the cache hit rate to minimize request on the origin
- control the TTL, can be set by the origin usign the cache-control header, expires header…
- you can invalidate part of the cache using the CreateInvalidation API
CloudFront signed URL / signed cookies
- you want to distrbute paid shared content to premium users over the world
- we can use CloudFront signed URL / signed cookie, we attach a policy with
- includes URL expiration
- includes IP ranges to access the data from
- trusted signers (which AWS accounts can create signed URLs)
- how long should the URL be valid for
- shared content (movie, music): make it short (a few minutes)
- private content (private to the user): you can make it for years
- signed URL: access to individual files (one signed URL per file)
- signed cookies: access to multiple files (one signed cookie for many files)
CloudFront signed URL vs S3 pre signed URL
Signed URL
- allow access to a path, no matter the origin
- account wide key pair, only the root can manage it
- can filter by IP, path, date, expiration
- can leverage caching features
S3 pre signed URL
- issue a request as the person who pre signed the URL
- uses the IAM key of the signing IAM principal (has the same access as the IAM user who create the URL)
- limited lifetime
CloudFront signed URL process
- two types of signers
- either a trusted key group (recommended)
- can leverage APIs to create and rotate keys (and IAM for API security)
- an AWS account that contains a CloudFront key pair
- need to manage keys using the root account and the AWS console
- not recommended because you shouldn’t use the root account for this
- either a trusted key group (recommended)
- in your CloudFront distribution, create one or more trusted key groups
- you generate your own public / private key
- the private key is used by your applications to sign URLs
- the public key is used by cloudfront to verify URLs
Price classes
- you can reduce the number of edge locations for cost reduction
- three price classes
- price class All: all regions - best performance
- price class 200: most regions, but excludes the most expensive regions
- price class 100: only the least expensive regions
Multiple origin
- to route to different kind of origins based on the content type
- based on path pattern
- /images/*
- /api/*
- /*
origin groups
- to increase high availability and do failover
- origin group: one primary and one secondary origin
- if the primary origin fails, the second one is used (works for both EC2 instance and S3 buckets)
field level encryption
- protect user sensitive information throught application stack
- adds an additional layer of security along with HTTPS
- sensitive information encrypted at the edge close to user
- uses asymmetric encryption
- usage:
- specify set of fields in POST request that you want to be encrypted (up to 10 fields)
- specify the public key to encrypt them
- fields will be encrypted using the public key at edge locations and will be decrypted when the request reached the web servers
ECS
Docker
-
Docker is a software development platform to deploy apps
-
apps are packaged in containers that can be run on any OS
-
apps run the same, regardless of where they are run
- any machine
- no compatibility issues
- predictable behavior
- less work
- easier to maintain and deploy
- works with any language, any OS, any technology
-
Docker images are stored in Docker repositories
-
public: Docker hub
-
private: Amazon ECR
-
Docker vs VM
- docker is sort of a virtualization technology, but not exactly
- resources are shared with the host => many containers on one server
-
Docker containers management
- to manage containers, we need a container management platform
- 3 choices
- ECS: Amazon’s own platform
- Fargate: Amazon’s own serverless platform
- EKS: Amazon’s managed Kubernates (open source)
ECS clusters overview
- ECS clusters are logical grouping of EC2 instances
- EC2 instances run the ECS agent (Docker container)
- the ECS agents registers the instance to the ECS cluster
- the EC2 instances run a special AMI, made specifically for ECS
ECS task definitions
- tasks definintions are metadata in JSON form to tell ECS how to run a Docker Container
- it contains crucial information around
- Image name
- port binding for container and host (80 -> 8080)
- memory and CPU required
- environment variables
- networking information
ECS service
- ECS service help define how many tasks should run and how they should be run
- they ensure that the number of tasks desired is running across our fleet of EC2 instances
- they can be linked to ELB / NLB / ALB if needed
ECS service with load balancer
- ALB has the dynamic port forwarding feature
- when you create ECS tasks it assign random port numbers to tasks
- multiple ECS tasks can be run on a single EC2 instances with different port numbers
- ALB can use the dynamic port forwarding feature to route traffic to these tasks based on their port number
ECR
- ECR is a private Docker image reporsitory
- access is controlled through IAM (if you have permission errors, check the policy)
how to login to ECR using AWS CLI
- if you have AWS CLI version 1
$(aws ecr get-login --no-include-email --region eu-west-1)
- then you need to execute the output of the above command
- if you have AWS CLI version 2
aws ecr get-login-password --region eu-west-1 | docker login --username AWS --password-stdin 12334556790.dkr.ecr.eu-west-1.amazonaws.com
- you could just execute the above command which is using the pipe feature
- Docker push and pull
Fargate
- when launching an ECS cluster, we have to create our EC2 instances
- if we need to scale, we need to add EC2 instances
- so we need to manage infrastructure…
- with Fargate, it is all serverless
- we don’t provision EC2 instances
- we just create task definitions, and AWS will run our containers for us
ECS IAM roles deep dive
- EC2 instance profile
- for EC2 instance to run ECS task, we need to install ECS agent on the EC2 instance
- ECS will do these things
- make API calls to ECS service
- send container logs to CloudWatch logs
- pull docker image from ECR
- so ECS agent will use the EC2 instance profile role to do these things
- ECS task role
- when we run ECS tasks on EC2 instance, each task will have its own role
- we use different roles for the different ECS services run
- task role in defined in the task definition
ECS tasks placement
- when a task of type EC2 is launched, ECS must determine where to place it, with the constraints of CPU, memory, and available port
- similarly, when a service scales in, ECS needs to determine which task to terminate
- to assist with this, you can define a task placement strategy and task placement constraints
- NOTE: this is only for ECS with EC2, not for Fargate
ECS task placement process
- task placement strategies are a best effort
- when Amazon ECS places tasks, it uses the following process to select container instances
- identify the instances that satisfy the CPU, memory, and port requirements in the task definition
- identify the instances that satisfy the task placement constraints
- identify the instances that satisfy the placement strategies
ECS task placement strategies
- Binpack
- place tasks based on the least available amount of CPU or memory
- this minimize the number of instances in use (cost savings)
- random
- place the task randomly
- spread
- place the task evenly based on the specified value
- example: instanceID, availability zone
- you can also mix the placement strategies together, e.g. use Spread for the AZ and Binpack for memory
ECS task placement constraints
- distinctInstance: place each task on a different container instance
- memberOf: places task on instances that satisfy an expression
- uses the Cluster Query language
- e.g. place tasks only on t2 instances
ECS service auto scaling
- CPU and RAM is tracked in CloudWatch at the ECS service level
- target tracking: target a specific average CloudWatch metric
- step scaling: scale based on CloudWatch alarms
- scheduled scaling: based on predictable changes
- ECS service scaling (task level) != EC2 auto scaling (instance level)
- Fargate auto scaling is much easier to setup (because of serverless)
ECS cluster capacity provider
- a capacity provider is used in association with a cluster to determine the infrastructure that a task runs on
- for ECS and Fargate users, the FARGATE and FARGATE_SPOT capacity providers are added automatically
- for Amazon ECS on EC2, you need to associate the capacity provider with an auto scaling group
- when you run a task or a service, you define a capacity provider strategy, to prioritize in which provider to run
- this allows the capacity provider to automatically provision infrastructure for you
- if you set the average CPU to be at most 70%, then cluster capacity provider will create a new EC2 instance for you when you create a new task to run
ECS data volumes
EC2 task strategies
- the EBS volume is already mounted onto the EC2 instances
- this allows your Docker containers to mount the EBS volume and extend the storage capacity of your task
- Problem: if your task moves from one EC2 instance to another one, it won’t be the same EBS volume and data, because EBS volume is mounted to the old EC2 instance
- use cases:
- mount a data volume between different containers on the same instance
- extend the temporary storage of a task
EFS file systems
- works for both EC2 tasks and Fargate tasks
- ability to mount EFS volumes onto tasks
- tasks launched in any AZ will be able to share the same data in the EFS volume
- Fargate + EFS = serverless + data storage without managing servers
- use case: persistent multi AZ shared storage for your containers
Bind Mounts sharing data between containers
- works for both EC2 tasks (using local EC2 instance storage) and Fargate tasks (get 4GB for volume mounts)
- useful to share an ephemeral storage between multiple containers part of the same ECS task
- great for sidecar container pattern where the sidecar can be used to send metrics / logs to other destinations
Beanstalk
developer problems on AWS
-
managing infrastructure
-
deploying code
-
configuring all the databases, load balancers, etc…
-
scaling concerns
-
most web apps have the same architecture (ALB + ASG)
-
all the developers want is for their code to run
-
possibly, consistently across different applications and environments
Elastic Beanstalk overview
- a developer centric view of deploying an application on AWS
- it uses all components’ we have seen before: EC2, ASG, ELB, RDS…
- managed services
- automatically handles capacity provisioning, load balancing, scaling, application health monitoring, instance configuration…
- just the application code is the responsiblity of the developer
- we still have full control over the configuration
- Beanstalk is free but you pay for the underlying instances
Components
-
application: collection of Elastic Beanstalk components (environments, versions, configurations)
-
application version: an iteration of your application code
-
environment
- collection of AWS resources running an application version (only one application version at a time)
- tiers: web server environment tier and worker environment tier
- you can create multiple environmnets (dev, test, prod…)
-
web environment
- use ELB with multiple EC2 instances running on different AZs and ASG to scale
-
worker environment
- use SQS queue with multiple EC2 instances running on different AZs and ASG will scale based on SQS’s length
Beanstalk deployment options
All at once
- fastest deployment
- application has downtime
- great for quick iterations in development environment
- no additional cost
Rolling
- application is running below capacity
- can set the bucket size, bucket size is the number of new instances we launched each time
- application is running both versions simultaneously
- no additional cost
- long deployment
Rolling with additional batches
- application is running at capacity
- can set the bucket size
- application is running both versions simultaneously
- small additional cost
- additional batch is removed at the end of the deployment
- longer deployment
- good for production environment
Immutable
- zero downtime
- new code is deployed to new instances on a temporary ASG
- high cost, double capacity
- longest deployment
- quick rollback in case of failures (just terminate new ASG)
- great for production
Blue / Green
- not a direct feature of Elastic Beanstalk
- zero downtime and release facility
- create a new stage environment and deploy version 2 there
- the new environment (green) can be validated independently and roll back if issue happens
- route 53 can be setup using weighted policies to redirect a little bit of traffic to the stage environment
- using Beanstalk, swap URLs when done with the environment test
Traffic splitting (Canary Testing)
- new application version is deployed to a temporary ASG with the same capacity
- a small percetage of traffic is sent to the temporary ASG for a configuraion amount of time
- deployment health is monitored
- if there is a deployment failure, this triggers an automated roll back (very quick)
- no application downtime
- new instances are migrated from the temporary to the original ASG
- old application version is then terminated
Deploy using CLI
- describe dependencies
- package code as zip, and describe dependencies
- console: upload zip file (creates new app version), and then deploy
- CLI: create new app version using CLI (uploads zip), and then deploy
- Elastic Beanstalk will deploy the zip on each EC2 instance, resolve dependencies and start the application
Beanstalk lifecycle policy
- Elastic Beanstalk can store at most 1000 application versions
- if you don’t remove old versions, you won’t be able to deploy anymore
- to phase out old application versions, use a lifecycle policy
- based on time (old versions are removed)
- based on space (when you have too many versions)
- versions that are currently used won’t be deleted
- option not to delete the source bundle in S3 to prevent data loss
Beanstalk extensions
- a zip file containing our code must be deployed to Elastic Beanstalk
- all the parameters set in the UI can be configured with code using files
- requriements
- in the
.ebextensions/
directory in the root of source code - YAML / JSON format
.config
extensions (example:logging.config
)- able to modify some default settings using:
option_settings
- ability to add resources such as RDS, ElastiCache, DynamoDB, etc…
- in the
- resources managed by
.ebextensions
get deleted if the environment goes away
Beanstalk vs CloudFormation
- under the hood, Elastic Beanstalk relies on CloudFormation
- CloudFormation is used to provision other AWS services
- use case: you can define CloudFormation resources in your
.ebextensions
to provision ElastiCache, S3 bucket, or anything you want.
Elastic Beanstalk cloning
- clone an environment with the exact same configuration
- useful for deploying a test version of your application
- all resources and configuration are preserved
- load balancer type and configuration
- RDS database type (but data is not preserved)
- environment variables
- after cloning an environment, you can change settings
Beanstalk migration
load balancer
- after creating an Elastic Beanstalk environmnet, you cannot change the ELB type
- to migrate to a different ELB
- create a new env with the same configuration except LB, create your new LB here
- deploy your application onto the new env
- perform a CNAME swap or Route 53 update so all your traffic can be direct to the new env
RDS
-
RDS can be provisioned with Beanstalk, which is greate for dev / test
-
this is not great for production as database lifecycle is tied to the Beanstalk environment lifecycle
-
the best for prod is to separately create an RDS database and provide our Beanstalk application with the connection string
-
but what if you have already created Beanstalk application with the RDS in production? How to migrate it to a new environment without RDS?
- create a snapshot of RDS DB (as a safeguard)
- go to the RDS console and protect the RDS database from deletion
- create a new environment, without RDS, point your application to the existing RDS in the old env
- perform a CNAME swap or Route 53 update, confirm it is working
- terminate the old env (RDS will not be deleted because you prevent it in the console)
- delete the CloudFormation stack manually (it will be in DELETE_FAILED state because it can’t delete RDS)
single Docker
- run your application as a single Docker container
- either provide
- Dockerfile: Elastic Beanstalk will build and run the Docker container
- Dockerrun.aws.json (v1): describe where the Docker image is (already built)
- Beanstalk in single Docker container does not use ECS
Multi Docker containers
- multi docker helps run multiple containers per EC2 instance in EB
- this will create for you
- ECS cluster
- EC2 instances, configured to use the ECS cluster
- load balancer (in HA mode)
- task definitions and execution
- requries a config Dockerrun.aws.json (v2) at the root of the source code
- Dockerrun.aws.json is used to generate the ECS task definition
Elastic Beanstalk and HTTPS
- Beanstalk with HTTPS
- idea: load the SSL certificate onto the load balancer
- can be done from the console (EB console, load balancer configuration)
- can be done from the code:
.ebextensions/securelistener-alb.config
- SSL certificate can be provisioned using ACM or CLI
- must configure a security group rule to allow incoming port 443 (HTTPS port)
- Beanstalk redirect HTTP to HTTPS
- configure your instances to redirect HTTP to HTTPS
- configure the application load balancer with a rule
- make sure health checks are not redirected (so they keep giving 200 OK, otherwise they will receive 301 and 302…)
Web server vs worker environment
- if your application performs tasks that are long to complete, offload these tasks to a dedicated worker environment
- decoupling your application into two tiers is common
- example: processing a video, generating a zip file, etc…
- you can define periodic tasks in a file
cron.yaml
custom platform (advanced)
- custom platforms are very advanced, they allow to define from scratch
- the OS
- additional software
- scripts that Beanstalk runs on these platforms
- use case: app language is incompatible with Beanstalk and doesn’t use Docker
- to create your own platform
- define an AMI using Platform.yaml file
- build that platform using the Packer software (open source tool to create AMIs)
- custom platform vs Custom image
- custom image is to tweak an existing Beanstalk platform
- custom platform is to create an entirely new Beanstalk platform
CICD
Introduction
- we now know how to create resources in AWS manually
- we know how to interact with AWS CLI
- we have seen how to deploy code to AWS using Elastic Beanstalk
- all these manual steps make it very likely for us to do mistakes
- what we would like is to push our code in a repository and have it deployed onto the AWS
- automatically
- the right way
- making sure it is tested before deploying
- with possibility to go into different stages
- with manual approval where needed
- to be a proper AWS developer, we need to learn AWS CICD
Continuous integration
- developers push the code to a code repository often (github, codecommit, bitbucket, etc…)
- a testing / build server checks the code as soon as it is pushed (codebuild, Jenkins CI, etc…)
- the developer gets feedback about the tests and checks that have passed / failed
- find bugs early, fix bugs
- deliver faster as the code is tested
- deploy often
Continuous delivery
- ensure that the software can be released reliably whenever needed
- ensures deployments happen often and are quick
- automated deployment
CodeCommit
-
version control is the ability to understand the various changes that happened to the code over time
-
all these are enabled by using a version control system such as git
-
a git repository can live on one’s machine, but it usually lives on a central online repository
-
benefits are
- collaborate with other developers
- make sure the code is backed up somewhere
- make sure it is fully viewable and auditable
-
git repositories can be expensive
-
the industry incldues
- github
- bitbucket
-
AWS CodeCommit
- private git repositories
- no size limit on repositories
- fully managed, HA
- code only in AWS, increased security and compliance
- secure
- integrated with Jenkins / CodeBuild / other CI tools
security
- interactions are done using git
- authentication in git
- SSH keys: AWS users can configure SSH keys in their IAM console
- HTTPS: done through the AWS CLI Authentication helper ot generating HTTPS credentials
- MFA can be enabled for extra security
- Authorization in git
- IAM policies manage user / roles rights to repositories
- encryption
- repositories are automatically encrypted at rest using KMS
- encrypted in transit (can only use HTTPS and SSH - both secure)
- cross account access
- do not share your SSH keys
- do not share your AWS credentials
- use IAM role in your AWS account and use AWS STS (with AssumeRole API)
CodeCommit vs Github
- Similarities
- both are git repositories
- both support code review
- github and CodeCommit can be integrated with AWS CodeBuild
- both support HTTPS and SSH method of authentication
- differences
- security
- github: github users
- codecommit: AWS IAM users / roles
- hosted:
- github: hosted by github
- github enterprise: self hosted on your servers
- codecommit: managed and hosted by AWS
- UI
- github UI is fully featured
- security
notifications
- you can trigger notifications in CodeCommit using AWS SNS or AWS lambda or CloudWatch event rules
- use cases for notifications SNS / lambda
- deletion of branches
- trigger for pushes that happens in master branch
- notify external build system
- trigger AWS lambda function to perform codebase analysis
- use cases for CloudWatch event rules
- trigger for pull request updates
- commit comment event
- CloudWatch event rules goes into an SNS topic
CodePipeline
- Continuous delivery
- visual workflow
- source: github / codecommit / S3
- build: codebuild / Jenkins
- load testing: third party tools
- deploy: AWS code deploy / Beanstalk / CloudFormation / ECS
- made of stages
- each stage can have sequential actions or parallel actions
- stages examples: build / test / deploy / load test
- manual approval can be defined at any stage
artifacts
- each pipeline stage can create artifacts
- artifacts are passed stored in S3 and passed on to the next stage
troubleshooting
- codepipeline state changes happen in CloudWatch events, which can in return create SNS notifications
- you can create events for failed pipelines
- you can create events for cancelled stages
- if codepipeline fails a stage, your pipeline stops and you can get information in the console
- CloudTrail can be used to audit AWS API calls
- if pipeline can’t perform an action, make sure the IAM service role attached does have enough permissions
CodeBuild
-
fully managed build service
-
alternative to other build tools such as Jenkins
-
continuous scaling (no servers to manage or provision - no build queue)
-
pay for usage: the time it takes to complete the builds
-
leverages Docker under the hood for reproducible builds
-
possibility to extend capabilities leveraging our own base Docker images
-
secure: integration with KMS for encryption of build artifacts, IAM for build permissions, and VPC for network security, CloudTrail for API calls logging
-
source code from github / codecommit / codepipeline / S3
-
build instructions can be defined in code (buildspec.yml file)
-
output logs to S3 and AWS cloudwatch logs
-
metrics to monitor codebuild statistics
-
use cloudwatch alarms to detect failed builds and trigger notifications
-
cloudwatch events / lambda as a Glue
-
SNS notifications
-
ability to reproduce codebuild locally to troubleshoot in case of errors
-
builds can be defined within CodePipeline or Codebuild itself
BuildSpec
- buildspec.yml file must be at the root of your code
- define environment variables
- plaintext variables
- secure secrets: use SSM parameter store
- phases
- install: install dependencies you may need for your build
- pre build: final commands to execute before build
- build: actual build commands
- post build: finishing touches (zip output)
- artifacts: what to upload to S3
- cache: files to cache to S3 for future build speedup
local build
- in case of need of deep troubleshooting beyond logs
- you can run CodeBuild on your laptop (after installing Docker)
- for this, leverage CodeBuild agent
CodeBuild in VPC
- by default, your Codebuild containers are launched outside your VPC
- therefore, by default, it cannot access resources in a VPC
- you can specify a VPC configuration
- VPC ID
- subnet ID
- security group ID
- they your build can access resources in your VPC
- use case: integration tests, data query, internal load balancers
CodeDeploy
- we want to deploy our application automatically to many EC2 instances
- these instances are not managed by Elastic Beanstalk
- there are several ways to handle deployments using open source tools (Ansible, Terraform, Chef, Pupper, etc…)
Steps
- Each EC2 machine (or on premises machine) must be running the CodeDeploy agent
- the agent is continuously polling AWS codeDeploy for work to do
- CodeDeploy sends appspec.yml file
- application is pulled from github or S3
- EC2 will run the deployment instructions
- CodeDeploy agent will report of success / faliure of deployment on the instance
other information
- EC2 instances are grouped by deployment group (dev / test / prod)
- lots of flexibility to define any kind of deployments
- CodeDeploy can be chained into CodePipeline and use artifacts from there
- CodeDeploy can reuse existing setup tools, works with any application, auto scaling integraion
- Note: Blue / Green only works with EC2 instances (not on premises)
- support for AWS lambda deployments
- CodeDeploy does not provision resources
primary components
- application: unique name
- compute platform: EC2 or on premises or lambda
- deployment configuration: deployment rules for success / failures
- EC2 or on premises: you can specify the minimum number of healthy instances for the deployment
- lambda: specify how traffic ;is routed to your updated lambda function versions
- deployment group: group of tagged instances (allows to deploy gradually)
- deployment type: in place deployment or Blue/Green deployment
- IAM instance profile: need to give EC2 the permissions to pull from S3 / github
- application revision: application code + appspec.yml file
- service role: role for CodeDeploy to perform what it needs
- target revision: target deployment application version
Appspec
- file section: how to source and copy from S3 / github to filesystem
- hooks: set of instructions to do to deploy the new version (hooks can have timeouts)
- applicationStop
- DownloadBundle
- BeforeInstall
- AfterInstall
- ApplicationStart
- ValidateService: really important
Deployment config
- Configs:
- one a time: one instance at a time, one instance fails => deployment stops
- half at a time: 50%
- all at once: quick but no healthy host, downtime, good for dev
- custom: min healthy host = 75%
- failures:
- instances stay in failed state
- new deployments will first be deployed to failed state instances
- to rollback: re-deploy old deployment or enable automated rollback for failures
- deployment targets
- set of EC2 instances with tags
- directly to an ASG
- mix of ASG / tags so you can build deployment segments
- customization in scripts with DEPLOYMENT_GROUP_NAME environment variables
CodeDeploy for EC2 and ASG
-
code deploy to EC2
- define how to deploy the application using appspec.yml + deployment strategy
- will do in place update to your fleet of EC2 instances
- can use hooks to verify the deployment after each deployment phase
-
code deploy to ASG
- in place updates
- updates current existing EC2 instances
- instances newly created by an ASG will also get automated deployments
- Blue / Green deployment
- a new auto scaling group is created (settings are copied)
- choose how long to keep the old instances
- must be using an ELB (for directing traffic to new ASG group)
- in place updates
CodeStar
- CodeStar is an integrated solution that regroups: github, codecommit, codebuild, codeDeploy, CloudFormation, codepipeline, cloudwatch
- helps quickly create CICD ready projects for EC2, lambda, Beanstalk
- supported language: C#, Go, HTML5, Java, Node.js, PHP, Python, Ruby
- issue tracking integration with JIRA, Github issues
- ability to integrate with Cloud9 to obtain a web IDE
- one dashboard to view all your components
- free services, pay only for the underlying usage of other services
- limited customization
CloudFormation
infrastructure as code
-
manual work will be very tough to reproduce
- in another region
- in another AWS account
- within the same region if everything was deleted
-
CloudFormation would be the code to create / update / delete our infrastructure
-
CloudFormation is a declarative way of outlining your AWS infrastructure, for any resources
-
CloudFormation creates the resources for you in the right order, with the exact configuration that you sepcify
benefits
-
infrastructure as code
- no resources are manually created, which is excellent for control
- the code can be version controlled for example using git
- changes to the infrastructure are reviewed through code
-
cost
- each resources within the stack is stagged with an identifier so you can easily see how much a stack costs you
- you can estimate the costs of your resources using the CloudFormation template
- savings strategy: in dev, you could automation deletion of templates at 5pm and recreate anything at 8am safely
-
productivity
- ability to destroy and recreate an infrastructure on the cloud on the fly
- automated generation of diagram for your templates
- declarative programming (no need to figure out ordering and orchestration)
-
separation of concern: create many stacks for many apps, and many layers
-
don’t reinvent the wheel
- leverage existing templates on the web
- leverage the documentation
how cloudformation works
- templates have to be uplaoded in S3 and then referenced in cloudformation
- to update a template, we can’t edit previous ones, we have to reupload a new version of the template to AWS
- stacks are identified by a name
- deleting a stack deletes every single artifact that was created by CloudFormation
deploying cloudformation template
- manual way
- editing templates in the CloudFormation designer
- using the console to input parameters
- automated way
- editing templates in a YAML file
- using the AWS CLI to deploy the templates
- recommended way when you fully want to automate your flow
building blocks
- templates components
- resources: your AWS resources declared in the template (mandatory)
- parameters: the dynamic inputs for your template
- mappings: the static variables for your templates
- outputs: references to what has been created
- conditionals: list of conditions to perform resource creation
- metadata
- template helpers
- references
- functions
resources
-
resources are the core of your CloudFormation template
-
they repreesent the different AWS components that will be created and configured
-
resources are declared and can reference each other
-
AWS figures out creation, updates and deletes of resources for us
-
can I create a dynamic amount of resources
- no you can’t
-
is every AWS services suported
- almost, only a few are not
parameters
- parameters are a way to provide inputs to your AWS CloudFormation template
- they are important to know about if
- you want to resue your templates across the company
- some inputs can not be determined ahead of time
- parameters are extremely powerful, controlled, and can precent errors from happening in your templates thanks you types
how to reference a parameter
- the
Fn::Ref
function can be leveraged to reference parameters - parameters can be used anywhere in a template
- the shorthand for this in YAML is
!Ref
- the function can also reference other elements within the template
Pseudo parameters
- AWS offers us pseudo parameters in any CloudFormation template
- these can be used at any time and are enabled by default
mappings
- mappings are fixed variables within your CloudFormation template
- they are very handy to differentiate between different environments (dev vs prod), regions, AMI types, etc…
- all the values are hardcoded within the template
when would you use mappings vs parameters
- mappings are great when you know in advance all the values that can be taken and that they can be deduced from variables such as
- region
- AZ
- AWS account
- environment
- they allow safer control over the template
- use parameters when the values are really user specific
accessing mapping values
- we use
Fn::FindInMap
to return a named value from a specific key !FindInMap [MapName, TopLevelKey, SecondLevelKey]
outputs
- the outputs section declares optional outputs values that we can import into other stacks (if you export them first)
- you can also view the outputs in the AWS console or using the AWS CLI
- they are very useful for example if you define a network CloudFormation, and output the variables such as VPC ID, and your subnet IDs
- it is the best way to perform some collaboration cross stack, as you let export handle their own part of the stack
- you can’t delete a CloudFormation stack if its outputs are being referenced by another CloudFormation stack
outputs example
- create a SSH security group as part of one template
- we create an output that references that security group
1 | Outputs: |
Cross stack reference
- we then create a second template that leverages that security group
- for this, we use the
Fn::ImportValue
function - you can’t delete the underlying stack until all the references are deleted too
1 | Resources: |
conditions
- conditions are used to control the creation of resources or outputs based on a condition
- conditions can be whatever you want them to be, but common ones are
- environment
- region
- parameter value
- each condition can reference another condition, parameter value or mapping
define a conditon
1 | Conditions: |
- the logical ID is for you to choose, it is how you name condition
- the intrinsic function can by any of the following
Fn::And
Fn::Equals
Fn::If
Fn::Not
Fn::Or
using a condition
- conditions can be applied to resources / outputs
1 | Resources: |
Intrinsic functions
Fn::Ref
- can be leveraged to reference
- parameters
- resources
Fn::GetAtt
- attributes are attached to any resources you create
- to know the attributes of your resources, the best place to look at is the documentation
- example: AZ of an EC2 machine
1 | Resources: |
Fn::FindInMap
- we use Fn::FindInMap to return a named value from a specific key
- !FindInMap [MapName, TopLevelKey, SecondLevelKey]
Fn::ImportValue
- import values that are exported in other templates
Fn::Join
- join values with a delimiter
- !Join [delimiter, [a, b, c]]
Fn::Sub
- used to substitute variables from a text, it is a very handy function that will allow you to fully customize your templates
- for example: you can combine Fn::Sub with references or AWS Pseudo variables
- String must contain ${VariableName} and will substitute them
Condition Functions
- the logical ID is for you to choose, it is how you name condition
CloudFormation rollbacks
- stack creation fails
- default: everything rolls back, we can look at the log
- option to disable rollback and troubleshoot what happened
- stack update fails:
- the stack automatically rolls back to the previous known working state
- ability to see in the log what happened and error messages
ChangeSets
- when you update a stack, you need to know what changes before it happens for greater confidence
- ChangeSets won’t say if the update will be successful
Nested Stacks
- stacks as part of other stacks
- they allow you to isolate repeated patterns / common components in separate stacks and call them from other stacks
- Example:
- load balancer configuration that is re used
- security group that is re used
- nested stacks are considered best practice
- to update a nested stack, always update the parent (root stack)
Cross stack vs nested stack
-
Cross stacks
- helpful when stacks have different lifecycles
- use outputs export and Fn::ImportValue
- when you need to pass export values to many stacks
-
nested stacks
- helpful when components must be re used
- example: re use how to properly configure an application load balancer
- the nested stack only is important to the higher level stack
StackSets
- create, update, or delete stacks across multiple accounts and regions with a single operation
- administrator account to create StackSets
- trusted accounts to create, update, delete stack instances from StackSets
- when you update a stack set, all associated stack instances are updated throughout all accounts and regions
CloudFormation drift
-
CloudFormation allows you to create infrastructure
-
but it doesn’t protect you against manual configuration changes
-
how do we know if our resources have drifted?
-
we can use CloudFormation drift
Monitoring
- AWS CloudWatch
- metrics: collect and track key metrics
- log: collect, monitor, analyze, and store log files
- events: send notifications when certain events happen in your AWS
- alarms: react in real-time to metrics / events
- X-Ray
- troubleshooting application performance and errors
- distrbuted tracing of microservices
- CloudTrail
- internal monitoring of API calls being made
- audit changes to AWS resources by your users
CLoudWatch metrics
- CloudWatch provides metrics for every services in AWS
- metric is a variable to monitor
- metric belong to namespaces
- dimension is an attribute of a metric
- up to 10 dimensions per metric
- metrics have timestamps
- can create CloudWatch dashboards of metrics
Detailed monitoring
- EC2 instance metrics have metrics every 5 minutes
- with detailed monitoring, you get data every 1 minute
- use detailed monitoring if you want to scale faster for your ASG
- the AWS free tier allows us to have 10 detailed monitoring metrics
- NOTE: EC2 memory usage is by default not pushed (must be pushed from inside the instance as a custom metric)
CloudWatch Custom metrics
- possibility to define and send your own custom metrics to CloudWatch
- example: RAM usage, disk space, number of logged in users
- use API call PutMetricData
- ability to use dimensions (attribute) to segment metrics
- instance id
- environment name
- matric resolution
- standard: 60 seconds
- high resolution: 1 / 5 / 10 / 30 seconds - higher cost
- important: the API accepts metric data points two weeks in the past and two hours in the future (make sure to configure your EC2 instance time correctly)
CloudWatch logs
-
applicatoins can send logs to CloudWatch using the SDK
-
CLoudWatch can collect log from
- Elastic Beanstalk: collection of logs from application
- ECS: colletion from containers
- AWS lambda: collection from function logs
- VPC flow logs: VPC specific logs
- API gateway
- CLoudTrail based on filter
- CloudWatch log agents: for example on EC2 machines
- route 53: log DNS queries
-
cloudWatch logs can go to
- batch exporter to S3 for archival
- stream to elasticSearch cluster for further analysis
-
ClouWatch logs can use filter expressions
-
logs storage architecture
- log groups: arbitrary name, usually representing an application
- log stream: instances within application / log files / containers
-
can define log expiration policies (never, 30 days, etc…)
-
using the AWS CLI we can tail CloudWatch logs
-
to send logs to CloudWatch, make sure IAM permissions are correct
-
security: encryption of logs using KMS at the group level
CloudWatch logs for EC2
- by default, no logs from EC2 machine will go to CloudWatch
- you need to run a CloudWatch agent on EC2 to push the log files you want
- make sure IAM permissions are correct
- the CLoudWatch log agent can be setup on premises too
CloudWatch logs agent vs Unified agent
- CloudWatch logs agent
- old version of the agent
- can only send to CloudWatch logs
- CloudWatch Unified agent
- collect additional system level metrics such as RAM, processes, etc…
- collect logs to send to CloudWatch logs
- centralized configuration using SSM parameters store
CloudWatch logs metric filter
- CloudWatch logs can use filter expressions
- for exampe: find a specific IP inside of a log
- or count occurrences of ERROR in your logs
- metric filters can be used to trigger alarms
- filter do not retroactively filter data. (it doesn’t not filter historical data, only data since filter is created is counted).
- filters only publish the metric data points for events that happen after the filter was created
- can be integrated with CloudWatch alarms, SNS, etc…
CloudWatch alarms
- alarms are used to trigger notifications for any metric
- various options
- alarm states
- OK
- INSUFFICIETN_DATA
- ALARM
- period
- length of time in seconds to evaluate the metric
- high resolution custom metrics: 10 sec, 30 sec or multiple of 60 sec
CloudWatch alarm targets
- stop, terminate, reboot or recover EC2 instances
- trigger auto scaling action
- send notification to SNS
good to know
- alarms can be created based on CloudWatch logs metrics filters
- to test alarms and notifications, you could set the alarm state using CLI
CloudWatch Events
- event pattern: intercept events from AWS service
- EC2 instance state change, code build failure, S3…
- can intercept any API call with CloudTrail integration
- schedule or Cron
- A json payload is created from the event and passed to a target
CloudWatch event bridge
- EventBridge is the next evolution of CloudWatch events
- default event bus: generated by AWS service
- partner event bus: receive events from SaaS service or applications
- custom event bus: for you own application
- event buses can be accessed by other AWS accounts
- rules: how to process the events (similar to CloudWatch event rules)
Schema registry
- eventBridge can analyze the events in your bus and infer the schema
- the schema registry allows you to generate code for your application that will know in advance how data is structured in the event bus
- schema can be versioned
Amazon EventBridge vs CloudWatch events
- EventBridge builds upon and extends CloudWatch events
- it uses the same service API and endpoint, and the same underlying service infrastructure
- EventBridge allows extension to add event buses for your custom applications and third party SaaS apps
- EventBridge has the schema registry capability
- EventBridge has a different name to mark the new capabilities
- over time, the CloudWatch events name will be replaced with EventBridge
X-Ray
- debugging in Production, the old way
- test locally
- add log statements everywhere
- re deploy in production
- log formats differ across applications using CLoudWatch and analytics is hard
X-ray advantages
- troubleshooting performance
- understand dependencies in a microservices architecture
- pinpoint service issues
- review request behavior
- find errors and exceptions
tracing
- tracing is an end to end way to following a request
- each component dealing with the request adds its own trace
- tracing is made of segments
- annotations can be added to traces to provide extra information
- ability to trace
- every request
- sample request (as a percentage for example or a rate per minute)
- X-Ray security
- IAM for authorization
How to enable?
- Your code must import the AWS X-Ray SDK
- very little code modification needed
- the application SDK will then capture
- calls to AWS service
- HTTP / HTTPS requests
- database calls
- queue calls
- install X-Ray daemon or enable X-Ray AWS integration
- X-Ray daemon works as a low level UDP packet interceptor
- lambda / other AWS services already run the X-Ray daemon for you
- each application must have the IAM rights to write data to X-Ray
X-Ray magic
- X-Ray service collects data from all the different services
- service map is computed from all the segments and traces
- X-Ray is graphical, so even non technical people can help troubleshoot
X-Ray troubleshooting
- if X-Ray is not working on EC2
- ensure the EC2 IAM role has the proper permissions
- ensure the EC2 instance is running the X-Ray daemon
- to enable on AWS lambda
- ensure it has an IAM execution role with proper policy
- ensure that X-Ray is imported in the code
X-Ray instrumentation and concepts
- instrumentation means the measure of product’s performance, diagnose errors and to write trace information
- to instrument your application code, you use the X-Ray SDK
- many SDK require only configuration changes
- you can modify your application code to customize and annotation the data that the SDK sends to X-Ray, using interceptors, filters, handlers, middleware…
X-Ray concepts
- segments: each application / service will send them
- subsegments: if you need more details in your segment
- trace: segments collected together to form an end to end trace
- sampling: decrease the amount of requests sent to X-Ray, reduce cost
- annotations: key value pairs used to index traces and use with filters
- metadata: key value pairs, not indexed, not used for searching
- the X-Ray daemon / agent has a config to send traces cross account
- make sure the IAM permissions are correct - the agent will assume the role
- this allows to have a central account for all your application tracing
X-Ray sampling rules
- with sampling rules, you control the amount of data that you record
- you can modify sampling rules without changing your code
- by default, the X-Ray SDK records the first request each second, and five percent of any additional requests
- one request per second is the reservior, which requests that at least one trace is recorded each second as long the service is serving requests
- Five percent is the rate, at which additional requests beyond the reservior size are sampled
X-Ray with Beanstalk
- Elastic Beanstalk platforms include the X-Ray daemon
- you can run the daemon by setting an option in the Elastic Beanstalk console or with a configuration file (in .ebextension/xray-daemon.config)
- make sure to give your instance profile the correct IAM permissions so that the X-Ray daemon can function correctly
- then make sure your application code is intrumentated with the X-Ray SDK
- note: the X-Ray daemon is not provided for multicontainer Docker
CloudTrail
- provides governance, compliance, and audit for your AWS account
- CloudTrail is enabled by default
- get an history of events / API calls made within your AWS accounts by
- console
- SDK
- CLI
- AWS services
- can put logs from CloudTrail into CLoudWatch logs or S3
- a trail can be applied to All regions (default), or a single region
- if a resource is deleted in AWS, investigate CloudTrail first
CloudTrail events
- management events
- operations that are performed on resources in you account
- examples
- configuring security
- configuring rules for routing data
- setting up logging
- by default, trails are configured to log management events
- can separate read events (that don’t modify resources) and write events (that may modify resources)
- data events
- by default, data events are not logged (because high volume operations)
- S3 object-level activity
- can separate read and write events
- lambda function execution activity
CloudTrail insights
- enable CloudTrail insights to detect unusual activity in your account
- inaccurate resource provisioning
- hitting service limits
- bursts of IAM actions
- gaps in periodic maintenance activity
- CloudTrail insights analyzes normal management events to create a baseline
- and then continuously analyzes write events to detect usuaual patterns
- anomalies appear in the CloudTrail console
- event is sent to S3
- eventBridge is generated
CloudTrail events retention
- events are stored for 90 days in CloudTrail
- to keep events beyond this period, log them to S3 and use Athena
CloudTrail vs CloudWatch vs X-Ray
- CloudTrail
- audit API calls made by users / services / AWS console
- useful to detect unauthorized calls or root cause of changes
- CloudWatch
- metrics over time for monitoring
- logs for storing application log
- alarms to send notifications in case of unexpected metrics
- X-Ray
- automated trace analysis and central service map visualization
- latency, errors and fault analysis
- request tracking across distributed systems
SQS
Communications between applications
- Synchronous
- synchronous between applications can be problematic if there are sudden spikes of traffic
- what if you need to suddenly encode 1000 videos but usually it is 10?
- asynchronous
- it is better to decouple your applications
- SQS: queue model
- SNS: pub/sub model
- Kinesis: real time streaming model
- these services can scale independently from our application
SQS - standard queue
- fully managed service, used to decouple applications
- attributes
- unlimited throughput, unlimited number of messages in queue
- default retention of messages: 4 to 14 days
- law latency
- limitation of 256 KB per message sent
- can have duplicate messages (at least once delivery, occasionally)
- can have out of order messages (best effort ordering)
Producing messages
- produced to SQS using the SDK (SendMessage API)
- the message is persisted in SQS until a consumer deletes it
comsuming messages
- consumers (running on EC2 instances, servers, or lambda)
- poll SQS for messages (receive up to 10 messages at a time)
- process the messages (example: insert the message into an RDS database)
- delete the messages using the DeleteMessage API
multiple EC2 instances consumers
- consumers receive and process messages in parallel
- at least once delivery
- best effort message ordering
- consumers delete messages after processing them
- we can scale consumers horizontally to improve throughput of processing (using ASG)
security
- encryption
- in flight encryption using HTTPS API
- at rest encryption using KMS keys
- client side encryption if the client wants to perform encryption / decryption itself
- access controls: IAM policies to regulate access to SQS API
- SQS access policies: (similar to S3 bucket policies)
- useful for cross account access to SQS queues
- useful for allowing other services (SNS, S3…) to write to an SQS queue
message visibility timeout
-
after a message is polled by a consumer, it becomes invisible to other consumers
-
by default, the message visibility timeout is 30 seconds
-
that means the message has 30 seconds to be processed
-
after the message visibility timeout is over, the message is visible again in SQS
-
if a message is not processed within the visibility timeout, it will be received by consumer again so it will be processed twice
-
a consumer could call the ChangeMessageVisibility API to get more time
-
if visibility timeout is high, and consumer crashes, it will take longer time for the message to become visible in the queue and being consumed by others
-
if the visibility timeout is low, we may get duplicates
Dead letter queue
- if a consumer fails to process a message within the visibility timeout, the message goes back to the queue
- we can set a threshold of how many times a message can go back to the queue
- after the MaximumRecevies threshold is exceeded, the message goes into a dead letter queue
- useful for debugging
- make sure to process the messages in the DLQ before they expire
- good to set a retention of 14 days in the DLQ
delay queue
- delay a message (consumers don’t see it immediately) up to 15 minutes
- default is 0 seconds (message is available right away)
- can set a default at queue level
- can override the default on send using the DelaySeconds parameter
long polling
- when a consumer requests messages from the queue, it can optionally wait for messages to arrive if there are none in the queue
- this is called long polling
- long polling decreases the number of API calls made to SQS while increasing the efficiency and latency of your application
- the wait time can be between 1 to 20 seconds
- long polling is preferable to short polling
- long polling can be enabled at the queue level or at the API level using WaitTimeSeconds
SQS extended client
- message size limit is 256 KB, how to send large messages?
- using the SQS extended client (Java Library)
- it can be implemented using any language, it first uploads the large object to S3
- then send the metadata of that object to SQS, once the consumer received the metadata, it will fetch the real object from S3.
FIFO queue
- First in first out
- limited throughput: 300 messages / second, without batching, 3000 m/s with batching
- exactly once send capability (by removing duplicates)
- messages are processed in order by the consumer
FIFO deduplication
- deduplication interval is 5 minutes
- two deduplication methods
- content based deduplication: will do a SHA-256 hash of the message body
- explicitly provide a message deduplication ID
- if the queue receives messages with the same hash key or the same deduplication ID, it will refuse to receive the message
message grouping
- if you specify the same value of MessageGroupID in an SQS FIFO queue, you can only have one consumer, and all the messages are in order
- to get ordering at the level of a subset of messages, specify different values for MessageGroupID
- messages that share a common message group ID will be in order within the group
- each group ID can have a different consumer (parallel processing)
- ordering across groups is not guaranteed
SNS
-
what if you want to send one message to many receivers?
-
the event producer only sends message to one SNS topic
-
as many event receivers as we want to listen to the SNS topic notifications
-
each subscriber to the topic will get all the messages (note: new feature to filter messages)
-
up to 10 million subscriptions per topic
-
100k topics limit
-
subscribers can be
- SQS
- HTTP / HTTPS
- lambda
- emails
- SMS messages
- mobile notifications
-
SNS integrates with a lot of AWS services
- many AWS services can send data directly to SNS for notifications
- CloudWatch alarms
- ASG notifications
- S3
- CloudFormation (upon state changes => failed to build etc…)
How to publish
- topic publish (using the SDK)
- create a topic
- create a subscription
- publish to the topic
- direct publish (for mobile apps SDK)
- create a platform application
- create a platform endpoint
- publish to the platform endpoint
- works with Google GCM, Apple APNS, Amazon ADM…
security
- encryption
- in flight encryption using HTTPS API
- at rest encryption using KMS keys
- client side encryption if the client wants to perform encryption / decryption itself
- access controls: IAM policies to regulate access to the SNS API
- SNS access policies (similar to S3 bucket policies)
- useful for cross account to SNS topic
- useful for allowing other services to write to an SNS topic
SNS + SQS: Fan out
- push once in SNS, receive in all SQS queues that are subscribers
- fully decoupled, no data loss
- SQS allows for: data persistence, delayed processing and retries of work
- ability to add more SQS subscribers over time
- make sure your SQS queue access policy allows for SNS to write
S3 events to multiple queues
- for the same combination of: event type and prefix, you can only have one S3 event rule
- if you want to send the same S3 event to many SQS queues, use fanout (SNS + SQS)
SNS - FIFO
- similar features as SQS FIFO
- ordering by message group ID
- deduplication using a deduplication ID or Content based deduplication
- can only have SQS FIFO queues as subscribers
- limited throughput (same throughput as SQS FIFO)
message filtering
- JSON policy used to filter messages sent to SNS topic’s subscriptions
- if a subscription doesn’t have a filter policy, it receives every message
Kinesis
Kinesis data streams
- billing is per shard provisioned, can have as many shards as you want
- retention between 1 to 365 days
- ability to reprocess data (because data will not be deleted by consumer, it stays in Kinesis data streams until retention period is over)
- once data is inserted in Kinesis, it can’t be deleted (immutability)
- data that shares the same partition goes to the same shard (shard level ordering)
- producers: AWS SDK, Kinesis Producer Library (KPL), Kinesis agent
- consumers
- write your own: Kinesis Client Library (KCL), AWS SDK
- managed: AWS lambda, Kinesis data firehose, Kinesis data analytics
Kinesis data streams security
- control access / authorization using IAM policies
- encryption in flight using HTTPS
- encryption at rest using KMS
- you can implement encryption / decryption of data on client side
- VPC endpoints available for Kinesis to access within VPC (e.g. EC2 instance in private subnet access Kinesis data stream using VPC endpoint)
- monitor API calls using CloudTrail
Kinesis consumers
Kinesis consumer types
Shared fanout consumer - pull | enhanced fanout consumer - push |
---|---|
low number of consuming applications | multiple consuming applications for the same stream |
read throughput 2MB/ second per shard across all consumers | 2 MB / second per consumer per shard |
max 5 GetRecords API calls / sec | - |
latency ~200ms | latency ~ 70ms |
minimize cost | higher cost |
consumers poll data from Kinesis using GetRecords API call | Kinesis push data to consumers over HTTP |
returns up to 10MB or up to 10000 records | soft limit of 5 consumer applications per data stream |
Kinesis Client library (KCL)
- a Java library that helps read record from a Kinesis Data Stream with distributed applications sharing the read workload
- each shard is to be read by only one KCL instance
- e.g. 4 shards => max 4 KCL instances
- progress is checkpointed into DynamoDB (needs IAM access from KCL instance to DynamoDB), this means if one KCL instance is down, DynamoDB will save the checkpoint and knows where to resume when KCL instance goes backup
- track other workers and share the work amongst shards using DynamoDB
- KCL can run on EC2, elastic Beanstalk and on premises
- records are read in order at the shard level
- versions
- KCL 1.x (supports shared consumer)
- KCL 2.x (supports shared and enhanced fanout consumer)
Kinesis operations
Shard splitting
- used to increase the Stream capacity
- used to divide a hot shard
- the old shard is closed and will be deleted once the data is expired (until the retention period is over)
- no automatic scaling (manually increase / decrease capacity)
- can’t split into more than two shards in a single operation
merging shards
- decrease the Stream capacity and save costs
- can be used to group two shards with low traffic
- old shards are closed and will be deleted once the data is expired
- can’t merge more than two shards in a single operation
Kinesis data firehose
- fully managed service, no administration, automatic scaling, serverless
- target: redshift, S3, ElasticSearch
- third party
- custom HTTP endpoint
- pay for data going through firehose
- near real time
- 60 seconds latency minimum for non full batches
- or minimum 32 MB of data at a time
- it is not real time because it will batch the data into 60 seconds of data or 32MB of data
- supports many data formats, conversions, transformations, compression
- supports custom data transformations using AWS lambda
- can send failed or all data to a backup S3 bucket
Kinesis data streams vs Firehose
Kinesis data streams | Kinesis data firehose |
---|---|
streaming service for ingest at scale | load streaming data into S3 / redshift / ElasticSearch / Thrid party / custom HTTP |
write custom code (producer / consumer) | fully managed |
real time (~200ms) | near real time (60 seconds or 32MB) |
manage scaling (shard spliting / shard merging) | automatic scaling |
data storage for 1 to 365 days | no data storage |
supports replay capability | doesn’t support replay capability |
Kinesis data analytics (SQL application)
- perform real time analytics on Kinesis streams using SQL
- fully managed, no server to provision
- automatic scaling
- real time analytics
- pay for actual consumption rate
- can create streams out of the real time queries
- use cases
- time series analytics
- real time dashboards
- real time metrics
SQS vs SNS vs Kinesis
SQS
- consumer pull data
- data is deleted after being consumed
- can have as many as workers as we want
- no need to provision throughput
- ordering guarantees only on FIFO queues
- individual message delay capability
SNS
- push data to many subscribers
- data is not persisted (lost if not delivered)
- pub/sub
- no need to provision throughput
- integrates with SQS for fanout architecture pattern
- FIFO capability for SQS FIFO
Kinesis
- standard: pull data, 2 MB per shard
- enhanced fanout: push data, 2 MB per shard per consumer
- possibility to replay data
- meant for real time big data, analytics and ETL
- ordering at the shard level
- data expires after X days
- must provision throughput
Kinesis vs SQS ordering
- let’s assume 100 trucks, 5 kinesis shards, 1 SQS FIFO
- Kinesis data streams
- on average you will have 20 trucks per shard
- trucks will have their data ordered within each shard
- the maximum amount of consumer in parallel we can have is 5
- SQS FIFO
- you only have one SQS FIFO queue
- you will have 100 group ID
- you can have up to 100 consumers (due to the 100 group ID)
- you have up to 300 message per second (or 3000 if using batching, because one GetRecords API call can receive up to 10 messages)
Lambda
what is serverless
- serverless is a new paradigm in which the developers don’t have to manage servers anymore
- they just deploy code
- serverless was pioneered by AWS lambda but now also includes anything that is managed: databases, messaging, storage, etc…
- serverless does not mean there are no servers, it means you just don’t manage / provision / see them
serverless in AWS
- lambda
- DynamoDB
- Cognito
- API Gateway
- S3
- SNS and SQS
- Kinesis data firehose
- Aurora serverless
- Step functions
- Fargate
Lambda synchronous invocations
- synchronous: CLI, SDK, API Gateway, ALB
- results is returned right away
- error handling must happen client side
lambda integration with ALB
- to expose a lambda function as an HTTP endpoint
- you can use the ALB or an API gateway
- the lambda function must be registered in a target group
- ALB will convert the request HTTP to JSON and convert the response JSON to HTTP
ALB multi healer values
- ALB can support multi header values
- when you enable multi value headers, HTTP headers and query string parametersthat are sent with multiple values are shown as arrays within the AWS lambda event and response objects
1 | HTTP |
lambda@Edge
-
you have deployed a CDN using CloudFront
-
what if you wanted to run a global lambda alongside?
-
or how to implement request filtering before reaching your application?
-
for this, you can use Lambda@edge, deploy lambda functions alongside your CloudFront CDN
- build more responsive applications
- you don’t manage servers, lambda is deployed globally
- customize the CDN content
- pay only for what you use
-
you can use lambda to change CloudFront requests and responses
- after CloudFront receives a request from a viewer
- before CloudFront forwards the request to the origin
- after CloudFront receives the response from the origin
- before CloudFront forwards the response to the viewer
-
you can also generate responses to viewers without ever sending the request to the origin
lambda - asynchronous invocations
- S3, SNS, CloudWatch events
- the events are placed in an event queue
- lambda attempts to retry on errors
- 3 tries total
- 1 minute after first, then 2 minutes wait
- make sure the processing is idempotent (result is the same after retry)
- if the function is retried, you will see duplicate logs entries in CloudWatch logs
- can define a DLQ - SNS or SQS - for failed processing (need correct IAM permissions for lambda to write to SQS)
- asynchronous invocations allow you to speed up the processing if you don’t need to wait for the result
lambda event source mapping
- Kinesis data Streams and DynamoDB Streams
- SQS and SQS FIFO queue
- common denominator: records need to be pulled from the source
- your lambda function is invoked synchronously
Streams and lambda (Kinesis and DynamoDB)
- an event source mapping creates an iterator for each shard, processes items in order
- start with new items, from the beginning or from timestamp
- processed items aren’t removed from the stream (other consumers can read them again)
- if traffic is low, we can use batch window to accumulate records before processing
- you can process multiple batches in parallel
- up to 10 batches per shard
- in order processing is still guaranteed for each partition key
Streams and lambda - error handling
- by default, if your function returns an error, the entire batch is reprocessed until the function succeeds, or the items in the batch expire
- to ensure in order processing, processing for the affected shard is paused until the error is resolved
- you can configure the event source mapping to
- discard old events
- restrict the number of retries
- split the batch on error (to work around lambda timeout issue, maybe there is not enough time to process the whole batch, so we split the batch to make it small and faster to process)
- discarded events can go to a Destination
SQS and SQS FIFO with lambda
-
event source mapping will pull SQS (long polling)
-
specify batch size (1 to 10 messages)
-
recommended: set the queue visibility timeout to 6x the timeout of your lambda function
-
to use a DLQ:
- setup on the SQS queue, not lambda (DLQ for lambda is only for async invocations)
- or use a lambda Destination for failures
-
lambda also supports in order processing for FIFO queues, scaling up to the number of active message groups
-
for standard queues, items aren’t necessarily processed in order
-
lambda scales up to process a standard queue as quickly as possible
-
when an error occurs, batches are returned to the queue as individual items and might be processed in a different grouping than the original batch
-
occasionally, the event source mapping receive the same item from the queue twice, even if no function error occurred
-
lambda deletes items from the queue after they are processed successfully
-
you can configure the source queue to send items to a DLQ if they can’t be processed
lambda event mapper scaling
- Kinesis data streams and DynamoDB streams
- one lambda invocation per stream shard
- if you use parallelization, up to 10 batches processed per shard simultaneously
- SQS standard
- lambda adds 60 more instances per minute to scale up
- up to 1000 batches of messages processed simultaneously
- SQS FIFO
- messages with the same group ID will be processed in order
- the lambda function scales to the number of active message groups
lambda - Destinations
- for asynchronous invocations, we can define destinations for successful and failed event
- SQS
- SNS
- lambda
- EventBridge bus
- note: AWS recommends you use Destinations instead of DLQ now (but both can be used at the same time)
lambda permissions - IAM roles and resource policies
- lambda execution role
- grants the lambda function permissions to AWS services / resources
- when you use an event source mapping to invoke your function, lambda uses the execution role to read event data (e.g. lambda need permission to pull messages from SQS)
- lambda resource based policies
- use resource based policies to give other accounts and AWS services permission to use your lambda resources
- similar to S3 bucket policies for S3 bucket
- an IAM principal can access lambda
- if the IAM policy attached to the principal authorizes it (user access)
- or if the resource based policy authorizes (service access)
- when an AWS service like S3 calls your lambda function, the resource based policy gives it access
lambda environment variables
- environment variable = key / value pair in string form
- adjust the function behavior without updating code
- the environment variable are available to your code
- lambda service adds its own system environment variables as well
- helpful to store secrets (encrypted by KMS)
- secrets can be encrypted by the lambda service key, or your own CMK
lambda logging and monitoring
- CLoudWatch logs
- lambda execution logs are stored in AWS CloudWatch logs
- make sure your AWS lambda function has an execution role with an IAM policy that authorizes writes to CloudWatch logs
- CLoudWatch metrics
- lambda metrics are displayed in AWS CloudWatch metrics
- invocations, Durations, concurrent executions
- error count, success rates, throttles
- async delivery failures
- iterator age (lagging for Kinesis and DynamoDB streams)
lambda tracing with X-Ray
- enable in lambda configuration (active tracing)
- runs the X-Ray daemon for you
- use AWS X-Ray SDK in code
- ensure lambda function has a correct IAM execution role to write to X-Ray
- the managed policy is called:
AWSXRayDaemonWriteAccess
- the managed policy is called:
lambda in VPC
lambda by default
- by default, your lambda function is launched outside your own VPC (in an AWS owned VPC)
- therefore it cannot access resources in your VPC
lambda in VPC
- you must define the VPC ID, the subnets and the security groups
- lambda will create an ENI in your subnets
- lambda needs
AWSLambdaVPCAccessExecutionRole
internet access
- a lambda function in your VPC does not have internet access
- deploying a lambda function in a public subnet does not give it internet access or a public IP
- deploying a lambda function in a private subnet gives it internet access if you have a NAT gateway / NAT instance
- you can use VPC endpoints to privately access AWS services without a NAT
lambda function performance
configuration
- RAM
- from 128MB to 3008MB in 64MB increments
- the more RAM you add, the more vCPU credits you get
- at 1792MB, a function has the equivalent of one full vCPU
- after 1792MB, you get more than one CPU, and need to use multi threading in your code to benefit from it
- if your application is CPU-bound (computation heavy), increase RAM
- timeout: default 3 seconds, maximum is 900 seconds
lambda execution context
- the execution context is a temporary runtime environment that initialize any external dependencies of your lambda code
- great for database connections, HTTP clients, SDK clients…
- the execution context is maintained for some time in anticipation of another lambda function invocation
- the next function invocation can reuse the context to execution time and save time in initializing connections objects (e.g. establish database connection outside of function handler)
- the execution context includes the
/tmp
directory
lambda function /tmp
space
- if your lambda function needs to download a big file to work
- if your lambda function needs disk space to perform operations
- you can use the
/tmp
directory - max size is 512 MB
- the directory content remains when the execution context is frozen, providing transient cache that can be used for multiple invocations (helpful to checkpoint your work)
- for permanent persistence of object, use S3
lambda concurrency
- concurrency limit: up to 1000 concurrent executions across entire account, so if one of your lambda function takes up all the concurrencies (if you didn’t setup reserved concurrency limit), the other lambda functions will be throttled.
- can set a reserved concurrency at the function level
- each invocation over the concurrency limit will trigger a throttle
- throttle behavior
- if synchronous invocation = return throttle error 429
- if asynchronous invocation = retry automatically and then go to DLQ
- if you need a higher limit, open a support ticket
lambda concurrency and asynchronous invocations
- if the function doesn’t have enough concurrency available to process all events, additional requests are throttled
- for throttling errors and system errors, lambda returns the event to the queue and attempts to run the funtion again for up to 6 hours
- the retry interval increases exponentially from 1 second after the first attempt to a maximum of 5 minutes
Cold start and provisioned concurrency
- cold start
- new instance => code is loaded and code outside the handler run (init)
- if the init is large, this process can take some time
- first request served by new instances has higher latency than the rest
- provisioned concurrency
- concurrency is allocated before the function is invoked (in advance)
- so the cold start never happens and all invocations have low latency
- application auto scaling can manage concurrency
lambda external dependencies
- if your lambda function depends on external libraries
- for example AWS X-Ray SDK, database client, etc…
- you need to install the packages alongside your code and zip it together
- upload the zip straight to lambda if less than 50MB, else to S3 first and reference from S3
- native libraries work: they need to be complied on Amazon Linux
- AWS SDK comes by default with every lambda function
lambda and CloudFormation
inline
- inline functions are very simple
- use the code.zipfile property
- you cannot include function dependencies with inline functions
through S3
- you must store the lambda zip in S3
- you must refer the S3 zip location in the CloudFormation code
- S3 bucket
- S3 key: full path to zip
- S3 object version: if versioned bucket
- if you update the code in S3, but don’t update S3 bucket, S3 key or S3 object version, CloudFormation won’t update your function because it will not detect the change
lambda layers
- externalize dependencies to re use them
lambda container images
- deploy lambda function as container images of up to 10GB from ECR
- pack complex dependencies, large dependencies in a container
- base images are available
- can create your own image as long as it implements the lambda runtime API
- test the containers locally using the lambda runtime interface emulator
- unified workflow to build apps
lambda versions and aliases
lambda versions
- when you work on a lambda function, we work on
$LATEST
, which is an unpublished mutable version - when we are ready to publish a lambda function, we create a version
- versions are immutable
- versions have increasing version numbers
- versions get their own ARN
- version = code + configuration
- each version of the lambda function can be accessed
lambda aliases
- aliases are pointers to lambda function versions
- we can define a dev, test, prod aliases and have them point at different lambda versions
- aliases are mutable
- aliases enable Blue / Green deployment by assigning weights to lambda functions
- aliases enable stable configuration of our event triggers / destinations
- aliases have their own ARNs
- aliases cannot reference other aliases
lambda and CodeDeploy
- CodeDeploy can help you automate traffic shift for lambda aliases
- feature is integrated within the SAM framework
- linear
- grow traffic every N minutes until 100%
- canary
- try X percent then 100%
- AllAtOnce
- immediate
- can create pre and post traffic hooks to check the health of the lambda function
lambda limits good to know - per region
- memory allocation: 128MB - 10 GB
- maximum execution time: 15 minutes
- environment variables: 4KB
- disk capacity in the function container in
/tmp
: 512 MB - concurrency executions: 1000
- lambda function deployment size(zipped): 50MB
- size of uncompressed deployment(code + dependencies): 250MB
- can use the
/tmp
directory to load other files at startup
lambda best practices
- perform heavy duty work outside of your function handler
- connect to databases
- initilize the SDK
- pull in dependencies
- use environment variables for
- database connection sttrings, S3 buckets, etc…
- passwords, sensitive values
- minimize your deployment package size to its runtime necessities
- break down the function
- remember lambda limits
- use Layers where necessary
- aviod using recursive code, never have a lambda function call itself
DynamoDB
NoSQL database
- non-relational databases and are distributed
- include MongoDB, DynamoDB…
- do not support query joins (or just limited support)
- all the data that is needed for a query is present in one row
- don’t perform aggregations such as SUM, AVG…
- scale horizontally
- there is no right or wrong for NoSQL or SQL, they just require to model the data differently and think about user queries differently
Amazon DynamoDB
- fully managed, highly available with replication across multiple AZ
- NoSQL database
- scales to massive workloads, distributed database
- millions of requests per second, trillions of row, 100s of TB of storage
- fast and consistent in performance (low latency on retrieval)
- integrated with IAM for security, authorization and administration
- enables event driven programming with DynamoDB streams
- low cost and auto scaling capabilities
basics
-
DynamoDB is made of Tables
-
each table has a Primary Key (must be decided at creation time)
-
each table can have an infinite number of items
-
each item has attributes (can be added over time - can be null)
-
maximum size of an item is 400KB
-
data types supported are:
- scalar types: String, Number, Binary, Boolean, Null
- Document types: List, Map
- Set Types: String Set, Number Set, Binary Set
-
Primary keys
- Partition Key (HASH)
- partition key must be unique for each item
- partition key must be diverse so that the data is distributed
- Partition Key + Sort Key (HASH + RANGE)
- the combination must be unique for each item
- data is grouped by partition key
- Partition Key (HASH)
Read / Write capacity modes
- control how you manage your table’s capacity
- provisioned mode (default)
- you specify the number of reads/ writes per second
- you need to plan capacity beforehand
- pay for provisioned read / write capacity units
- on demand mode
- read / writes automatically scale up / down with your workloads
- no capacity planning needed
- pay for what you use, more expensive
- you can switch between different modes once every 24 hours
R/W capacity modes - provisioned
- table must have provisioned read an dwrite capacity units
- read capacity units (RCU)
- write capacity units
- option to setup auto scaling of throughput to meet demand
- throughput can be exceeded temporarily using brust capacity
- if burst capacity has been consumed, you will get a
ProvisionedThroughpuutExceededException
- it is then advised to do an exponential backoff retry
Write Capacity units (WCU)
- one WCU represents one write per second for an item up to 1KB in size
- if the items are larget then 1 KB, more WCUs are consumed
Strongly consistent read vs Eventually consistent read
- Eventually consistent read (default)
- if we read just after a write, it is possible we will get some stale data because of replication
- Strongly consistent read
- if we read just after a write, we will get the correct data
- set ConsistentRead parameter to True in API calls
- consumes twice the RCU
Read capacity units (RCU)
- one RCU represents one Strongly Consistent Read per second, or two Eventually consistent reads per second, for an item up to 4KB
- if the items are larger than 4KB, more RCUs are consumed
Paritions Internal
- data is stored in partitions
- partition keys go through a hashing algorithm to know to which partition they go to
- WCUs and RCUs are spread evenly across partitions
Throttling
- if we exceed provisioned RCUs or WCUs, we get
ProvisionedThroughputExceededException
- reasons
- hot keys: one partition key is being read too many times (popular item)
- hot partitions
- very large items, remember RCU and WCU depends on size of items
- solutions
- exponential backoff when exception is encountered
- distribute partition keys as much as possible
- if RCU issue, we can use DynamoDB Accelerator (DAX)
on demand
- Read and writes automatically scale up and down with your workloads
- no capacity planning needed
- unlimited WCU and RCU, no throttle, more expensive
- you are charged for reads and writes that you use in terms of RRU and WRU
- read request units (RRU) - throughput for reads (same as RCU)
- write request units (WRU) - throughput for writes (same as WCU)
- 2.5x more expensive than provisioned capacity
- use cases: unknown workloads, unpredictable application traffic…
writing data
- PutItem
- creates a new item or fully replace an old item
- consumers WCUs
- UpdateItem
- edits an existing item’s attributes or adds a new item if it doesn’t exist
- can be used to implement Atomic Counters - a numeric attribute that is unconditionally incremented
- conditional writes
- accept a write / update / delete only if conditions are met, otherwise returns an error
- helps with concurrent access to items
- no performance impact
reading data
- GetItem
- read based on primary key
- primary key can be HASH or HASH + RANGE
- eventually consistent read
- option to use strongly consistent reads (more RCU - might take longer)
- ProjectionExpression can be specified to retrieve only certain attributes
reading data - query
- query returns items based on
- KeyConditionExpression
- partition key value - required
- sort key value = optional
- FilterExpression
- additional filtering after the query operation (before data returned to you)
- use only with non key attributes
- KeyConditionExpression
- returns
- the number of items specified in limit
- or up to 1 MB of data
- ability to do pagination on the results
- can query table, a local secondary index, or a global secondary index
reading data - scan
- scan the entire table and then filter out data (inefficient)
- returns up to 1 MB of data - use pagination to keep on reading
- consumes a lot of RCU
- limit impact using Limit or reduce the size of the result and pause
- for faster performance, use parallel scan
- multiple workers scan multiple data segments at the same time
- increases the throughput and RCU consumed
- limit the impact of parallel scans just like you would for Scans
- can use ProjectionExpression and FilterExpression
- filtering will be done at the client side (e.g. in the browser)
deleting data
- DeleteItem
- delete an individual item
- ability to perform a conditional delete
- DeleteTable
- delete a whole table and all its items
- much quicker deletion than calling DeleteItem on all items
batch operations
- allows you to save in latency by reducing the number of API calls
- operations are done in parallel for better efficiency
- part of a batch can fail, in which case we need to try again for the failed items
- BatchWriteItem
- up to 25 PutItem and DeleteItem in one call
- up to 16 MB of data written, up to 400KB of data per item
- can’t update items
- BatchGetItem
- return items from one or more tables
- up to 100 items, up to 16 MB of data
- items are retrieved in parallel to minimize latency
Local Secondary Index (LSI)
- alternative sort key for your table (use the same partition key)
- the sort key consists of one scalar attribute
- up to 5 local secondary indexes per table
- must be defined at table creation time
- attribute projections - can contain some or all the attributes of the base table
Global secondary index (GSI)
- alternative Primary key (HASH + HASH + RANGE) from the base table
- speed up queries on non key attributes
- the index key consists of scalar attributes
- attribute projections - some or all the attributes of the base table
- must provision RCUs and WCUs for the index
- can be added / modified after table creation
indexes and throttling
- GSI
- if the writes are throttled on the GSI, then the main table will be throttled
- even if the WCU on the main tables are fine
- choose your GSI partition key carefully
- assign your WCU capacity carefully
- LSI
- uses the WCUs and RCUs of the main table
- no special throttling considerations
Optimistic locking
- DynamoDB has a feature called Conditional Writes
- a strategy to ensure an item hasn’t changed before you update / delete it
- each item has an attribute that acts as a version number, and each update / delete request will change the value of the item, and also update the version number
- if two request send at the same time, only one will succeed because the second request will not try to change the item because the version is different already.
DynamoDB DAX
- fully managed, highly available, seamless in memory cache for DynamoDB
- microseconds latency for cached reads and queries
- doesn’t require application logic modification
- solves the hot key problem (too many reads)
- 5 minutes TTL for cache (default)
- up to 10 nodes in the cluster
- multi AZ
- secure
DAX vs ElastiCache
- DAX is for individual object cache and simple query and scan
- ElastiCache can store aggregation result and complex intermediate results
DynamoDB Streams
-
ordered stream of item level modifications in a table
-
stream records can be
- sent to Kinesis Data Streams
- read by AWS lambda
- read by Kinesis Client Linrary applications
-
data retention for up to 24 hours
-
use case
- react to changes in real time
- analytics
- insert into derivative tables
- insert into ElasticSearch
- implement cross region replication
-
ability to choose the information that will be written to the stream
- KEYS_ONLY - only the key attributes of the modified item
- NEW_IMAGE - the entire item, as it appears after it was modified
- OLD_IMGAE - the entire item, as it appeared before it was modified
- NEW_AND_OLD_IMAGES - both the new and old images of the item
-
DynamoDB streams are made of shards, just like Kinesis Data Streams, so Kinesis KCL can be the consumer for DynamoDB Streams
-
you don’t need to provision shards, this is automated by AWS
-
records are not retroactively populated in a stream after enabling it
Streams and lambda
- you need to define an Event Source Mapping to read from DynamoDB streams
- you need to ensure the lambda function has the appropriate permissions
- your lambda function is invoked synchronously
DynamoDB TTL
- automatically delete items after an expiry timestamp
- doesn’t consume any WCUs
- the TTL attribute must be a number data type with Unix Epoch timestamp value
- expired items deleted within 48 hours of expiration
- expired items that haven’t been deleted, appears in reads/queries/scans (if you don’t want them, filter them out)
- expired items are deleted from both LSIs and GSIs
- a delete operation for each expired item enters the DynamoDB streams (can help recover expired items)
- use cases: reduce stored data by keeping only current items, adhere to regulatory obligations, user sessions…
DynamoDB CLI
--projection-expression
: one or more attributes to retrieve--filter-expression
: filter items before returned to you- general CLI pagination options
--page-size
: specify that CLI retrieves the full list of items but with a larger number of API calls instead of one API call--max-items
: max number of items to show in the CLI (returns NextToken)--starting-token
: specify the last NextToken to retrieve the next set of items
DynamoDB transactions
- coordinated, all or nothing opeartions to multiple items across one or more tables
- provides Atomicity, Consistency, Isolation, and Durability (ACID)
- read modes - Eventual consistency, strong consistency, transactional
- write modes - standard, transactional
- consumers 2x WCUs and 2x RCUs
- two operations
- TransactGetItems - one or more GetItem operations
- TransactWriteItems - one or more PutItem, UpdateItem, DeleteItem operations
- use cases: financial transactions, managing orders, multi player games…
DynamoDB Session State Cache
- it is common to use DynamoDB to store session state
- vs ElastiCache
- ElastiCache is in memory, but DynamoDB is serverless with auto scaling
- both are key value pairs
- vs EFS
- EFS must be attached to EC2 instances as a network drive
- vs EBS and Instance store
- EBS and Instance store can only be used for local caching, not shared caching
- vs S3
- S3 is higher latency, and not meant for small objects
DynamoDB Security and other features
- security
- VPC endpoints available to access DynamoDB without using the internet
- access fully controled by IAM
- encryption at rest using KMS and in transit using SSL/TLS
- backup and restore feature available
- point in time recovery (PITR) like RDS
- no performance impact
- global tables
- multi region, multi active, fully replicated, high performance, need to enable DynamoDB streams first
- DynamoDB local
- develop and test apps locally without accessing the DynamoDB web service (without internet)
- AWS database migration service can be used to migrate to DynamoDB
Fine-Grained access control
- using web identity federation or cognito identity pools, each user gets AWS credentials
- you can assign an IAM role to these users with a condition to limit their API access to DynamoDB
- Leading Keys - limit row level access for users on the primary key
- Attributes - limit specific attributes the user can see
API Gateway
Integrations high level
- lambda function
- invoke lambda function
- easy way to expose REST API backed by lambda
- HTTP
- expose HTTP endpoints in the backend
- why? add rate limiting, caching, user authentications, API keys, etc…
- AWS service
- expose any API through the API Gateway
- example: Step function workflow, post a message to SQS
- why? add authentication, deploy publicly, rate control…
endpoint types
- Edge-Optimized (default): for global clients
- requests are routed through the CloudFront Edge locations
- the API Gateway still lives in only one region
- regional
- for clients within the same region
- could manually combine with CloudFront (more control over the caching strategies and the distribution)
- private
- can only be accessed from your VPC using an interface VPC endpoint (ENI)
- use a resource policy to define access
Deployment stages
- making changes in the API Gateway does not mean they are effective
- you need to make a deployment for them to be in effect
- changes are deployed to Stages (as many as you want)
- use the naming you like for stages (dev, test, prod)
- each stage has its own configuration parameters
- stages can be rolled back as a history of deployments is kept
stage variables
- stage variables are like environment variables for API Gateway
- use them to change often changing configuration values
- they can be used in
- lambda function ARN
- HTTP endpoint
- parameter mapping templates
- use cases:
- configure HTTP endpoints your stages talk to
- pass configuration parameters to lambda through mapping templates
- stage variables are passed to the context object in lambda
stage variables with lambda aliases
- we can create a stage variable to indicate the corresponding lambda alias
- our API gateway will automatically invoke the right lambda function
canary deployment
- possibility to enable canary deployments for any stage
- choose the percentage of traffic the canary channel receives
- metrics and logs are separate (for better monitoring)
- possiblity to override stage variables for canary
- this is Blue / Green deployment with lambda and API gateway
API Gateway - Integration types
- MOCK
- API Gateway returns a response without sending the request to the backend (for testing and dev purpose)
- HTTP / AWS
- you must configure both the integration request and integration response
- setup data mapping using mapping templates for the request and response
- AWS_PROXY (lambda proxy)
- incoming request from the client is the input to lambda
- the function is responsible for the logic of request / response
- no mapping template, headers, query string parameters
- HTTP_PROXY
- no mapping template
- the HTTP request is passed to the backend
- the HTTP response from the backend is forwarded by API gateway
Mapping template
- mapping templates can be used to modify request / response
- rename and modify query string parameters
- modify body content
- add headers
- uses Velocity template language
- filter output results
Mapping template: JSON to XML with SOAP
- SOAP API are XML based, whereas REST API are JSON based
- in this case, API gateway should
- extract data from the request: either path, payload or header
- build SOAP message based on request data (mapping template)
- call SOAP service and receive XML response
- transform XML response to desired format and respond to the user
API Gateway Swagger / Open API spec
- common way of defining REST APIs, using API defintion as code
- import existing Swagger / OpenAPI 3.0 spec to API Gateway
- method
- method request
- integration request
- method response
- extensions for API Gateway and setup every single option
- can export current API as Swagger / OpenAPI spec
- swagger can be written in YAML or JSON
Caching API response
- caching reduces the number of calls made to the backend
- default TTL is 300 seconds
- caches are defined per stage
- possible to override cache settings per method
- cache encryption option
- cache capacity between 0.5 to 237 GB
- cache is expensive, makes sense in production, may not make sense in dev and test
API Gateway cache invalidation
- able to flush the entire cache immediately
- clients can invalidate the cache with header:
Cache-Control: max-age=0
(with proper IAM authorization) - if you don’t impose an InvalidateCache policy or choose the require authorization check box in the console, any client can invalidate the API cache, which is not good.
Usage plan and API keys
- if you want to make an API available as an offering to your customers
- usage plan
- who can access one or more deployed API stages and methods
- how much and how fast they can access them
- uses API keys to identify API clients and meter access
- configure throttling limits and quota limits that are enforced on individual client
- API keys
- alphanumberic string values to distribute to your customers
- can use with usage plans to control access
- throttling limits are applied to API keys
- quotas limits is the overall number of maximum requests
logging and tracing
- CloudWatch logs
- enable CloudWatch logging at the stage level
- can override settings on a per API basis
- log contains information about request / response body
- X-Ray
- enable tracing to get extra information about requests in API gateway
- X-Ray API Gateway + Lambda gives you the full picture
CloudWatch metrics
- metrics are by stage, possiblity to enable detailed metrics
- CacheHitCount and CacheMissCount: efficiency of the cache
- Count: the total number of API requests in a given period
- IntegrationLatency: the time between when API Gateway relays a request to the backend and when receives a response from the backend
- Latency: the time between when API gateway receives a request from a client and when it returns a response to the client, the latency includes the integration latency and other API gateway overhead
- 4xx Error (client side) and 5xx error (server side)
throttling
- account limit
- API gateway throttles requests at 10000 rps across all API
- soft limit that can be increased upon request
- in case of throttling = 429 too many requests
- can set stage limit and method limits to improve performance
- or you can define usage plans to throttle per customer
- just like lambda concurrency, one API that is overloaded, if not limited, can cause the other APIs to be throttled too.
CORS
- CORS must be enabled when you receive API calls from another domain
- the OPTIONS pre flight request must contain the following headers
- Access-Control-Allow-Methods
- Access-Control-Allow-Headers
- Access-Control-Allow-Origin
- CORS can be enabled through the console
Authentication and Authorization
- IAM
- great for users already within your AWS accounts + resource policy for cross account
- Custom Authorizer
- great for third party tokens
- very flexible in terms of what IAM policy is returned
- Cognito User Pool
- you manage your own user pool
- no need to write any custom code
- must implement authorization in the backend
WebSocket API
- what is WebSocket
- two way interactive communication between a user’s browser and a server
- server can push information to the client
- this enables stateful application use cases
- WebSocket APIs are often used in real time applications such as chat applications, collaboration platforms, multiplayer games, and financial trading platforms
- works with AWS services (lambda, DynamoDB) or HTTP endpoints
Routing
- incoming JSON messages are routed to different backend
- if no routes => send to default
- you request a route selection expression to select the field on JSON to route from
- the result is evaluated against the route keys available in your API gateway
- the route is then connceted to the backend you have setup through API gateway
Architecture
- create a single interface for all the microservices in your company
- use API endpoints with various resources
- apply a simple domain name and SSL certificates
- can apply forwarding and transformation rules at the API gateway level
SAM (serverless application model)
- framework for developing and deploying serverless applications
- all the configurations is YAML code
- generate complex CloudFormation from simple SAM YAML file
- supports anything from CLoudFormation
- only two commmands to deploy to AWS
- SAM can use CodeDeploy to deploy lambda functions
- SAM can help you to run lambda, API gateway, DynamoDB locally
Recipe
- transform header indicates its SAM template
Transform:
- write code
- AWS::Serverless::Function
- AWS::Serverless::Api
- AWS::Serverless::SimpleTable
- package and deploy
- aws cloudformation package / sam package
- aws cloudformation deploy / sam deploy
SAM policy templates
- list of templates to apply permissions to your lambda functions
- important examples
- S3ReadPolicy: give read only permissions to objects in S3
- SQSPollerPolicy: allows to poll an SQS queue
- DynamoDBCrudPolicy: CRUD = create read update delete
1 | MyFunction: |
SAM Sumary
- SAM is built on CloudFormation
- SAM requires the Transform and Resources sections
- commands to know
- sam build: fetch dependencies and create local deployment artifacts
- sam package: package and upload to Amazon S3, generate CloudFormation template
- sam deploy: deploy to CloudFormation
- SAM policy templates for easy IAM policy definition
- SAM is integrated with CodeDeploy to do deploy to lambda aliases
Serverless Application Repository (SAR)
- managed repository for serverless applications
- the applications are packaged using SAM
- build and publish applications that can be re used by organizations
- can share publicly
- can share with specific accounts
- this prevents duplicate work, and just go straight to publishing
- application settings and behavior can be customized using Environment variables
Cloud Development Kit (CDK)
- define your cloud infrastructure using a familiar language
- contains high level components called
constructs
- the code is complied into a CloudFormation template (YAML / JSON)
- you can therefore deploy infrastructure and application runtime code togther
- great for lambda functions
- great for Docker Containers in ECS / EKS
CDK vs SAM
- SAM
- serverless focused
- write your template declaratively in JSON or YAML
- great for quickly getting started with lambda
- leverages CloudFormation
- CDK
- all aws services
- write infra in a programming language
- leverages CloudFormation
Cognito
- we want to give our users an identity so that they can interact with our application
- Cognito user pools
- sign in functionality for app users
- integrate with API gateway and ALB
- Cognito Identity Pool (federated identity)
- provide AWS credentials to users so they can access AWS resources directly
- integrate with Cognito user pools as an identity provider
- Cognito Sync
- Synchronize data from device to Cognito
- is deprecated and replaced by AppSync
Cognito User Pools
-
create a serverless database of user for your web and mobile apps
-
simple login: username and password combination
-
password reset
-
email and phone number verification
-
federated identities: users from Facebook, Google, SAML…
-
feature: block users if their credentials are compromised elsewhere
-
login send back a JSON web token (JWT)
-
Cognito has a hosted authentication UI that you can add to your app to handle signup and signin workflows
-
using the hosted UI, you have a foundation for integration with social logins, OIDC or SAML
-
can customize with a custom logo and custom CSS
Cognito Identity Pools
- get identities for users so they obtain temporary AWS credentials
- your identity pool can include
- public providers (login with Amazon, Facebook, Google, Apple)
- users in an Amazon Cognito user pool
- OpenID Connect Providers and SAML identity providers
- developer authenticated identities
- Cognito identity pools allow for unauthenticated (guest) access
- users can then access AWS service directly or through API gateway
- the IAM policies applied to the credentials are defined in Cognito
- they can be customized based on the user_id for fine grained control
IAM roles
- default IAM roles for authenticated and guest users
- define rules to choose the role for each user based on the user’s ID
- you can partition your users’ access using policy variables
- IAM credentials are obtained by Cognito identity pools through STS
- the roles must have a trust policy of Cognito identity pools
Cognito User Pools vs Cognito Identity Pools
- Cognito User Pool
- database of users for your web and mobile application
- allows to federate logins through public social identity provider, OIDC, SAML…
- can customize the hosted UI for authentication
- has triggers with AWS lamdba during the authentication flow
- Cognito identity pools
- obtain AWS credentials for your users
- users can login through public social, OIDC, SAML and Cognito User Pools
- users can be unauthenticated
- users are mapped to IAM roles and policies, can leverage policy variables
- CUP + CIP = manage users / password + access AWS services
Cognito Sync
- Deprecated - use AWS AppSync now
- store preferences, configuration, state of app
- cross device synchronization
- offline capability
- store data in datasets
- push sync: silently notify across all devices when identity data changes
- Cognito Stream: stream data from Cognito into Kinesis
- Cognito Events: execute lambda functions in response to events
Step Functions
- model your workflows as state machines (one per workflow)
- order fulfillment, data processing
- web applications, any workflow
- written in JSON
- visualization of the workflow and the execution of the workflow, as well as history
- start workflow with SDK call, API gateway, eventbridge
task states
- do some work in your state machine
- invoke one service
- can invoke a lambda function
- run an batch job
- run an ECS task and wait for it to complete
- insert an item from DynamoDB
- publish message to SNS, SQS
- launch another step function workflow
- run an activity
- EC2, Amazon ECS, on premises
- activities poll the step functions for work
- activities send result back to step functions
states
- choice state: test for a condition to send to a branch
- fail or succeed state: stop execution with failure or success
- pass state: simply pass its input to its output or inject some fixed data, without performing work
- wait state: provide a delay for a certain amount of time or until a specified time/date
- Map state: dynamically iterate steps
- parallel state: begin parallel branches of execution
Error handling
- any state can encounter runtime errors for various reasons
- state machine definition issues
- task failtures
- transient issues
- use
retry
andcatch
in the state machine to handle the errors instead of inside the application code - the state may report its own errors
Retry
- evaluated from top to bottom
- ErrorEquals: match a specific kind of error
- IntervalSeconds: initial delay before retrying
- BackOffRate: multiple the delay after each retry
- MaxAttempts: default to 3, set to 0 for never retried
- When max attempts are reached, the
catch
kicks in
Catch
- evaluated from top to bottom
- ErrorEquals: match a specific kind of error
- Next: state to send to
- ResultPath: a path that determines what input is sent to the state specified in the Next field
ResultPath
- include the error in the input
AppSync
- AppSync is a managed service that uses GraphQL
- GraphQL makes it easy for applications to get exactly the data they needed
- this includes combining data from one or more sources
- retrieve data in real time with WebSocket or MQTT on WebSocket
- for mobile apps: local data access and data Synchronization
- it all starts with uploading one GraphQL schema
Security
- there are four ways you can authorize applications to interact with your AppSync GraphQL API
- API KEY
- IAM
- OPENID_CONNECT
- COGNITO USER POOLS
- for custom domain and HTTPS, use CloudFront in front of AppSync
STS (security Token service)
- Allows to grant limited and temporary access to AWS resources
- AssumeRole: assume roles within your account or cross account
- AssumeRoleWithSAML: return credentials for users logged in with SAML
- AssumeRoleWithWebIdentity
- return credentials for users logged with an IDP
- AWS recommends against using this, and using Cognito User Pools instead
- GetSessionToken: For MFA, from a user or account root user
- GetFederationToken: obtain temporary credentials for a federated user
- GetCallerIdentity: return details about the IAM user or role used in the API call
- DecodeAuthorizationMessage: decode error message when an AWS API is called
using STS to assume a role
- define an IAM role within your account or cross account
- define which principals can access this IAM role
- user STS to retrieve credentials and impersonate the IAM role you have access to
- temporary credentials can be valid between 15 minutes to 1 hour
STS with MFA
- use GetSessionToken from STS
- appropriate IAM policy using IAM conditions
aws:MultiFactorAuthPresent:true
- GetSessionToken returns
- access ID
- secret key
- session token
- expiration date
Advanced IAM
IAM policies and S3 Bucket policies
- IAM policies are attached to users, roles and groups
- S3 bucket policies are attached to buckets
- when evaluating if an IAM principal can perform an operation X on a bucket, the
union
of its assigned IAM policies and S3 bucket policies will be evaluated at the same time.
Dynamic policies with IAM
- how do you assign each user access to their own foler in S3 bucket?
- create one dynamic policy with IAM
- leverage the special policy variable
${aws:username}
inline vs managed policies
- AWS managed policy
- maintained by AWS
- good for power users and administrators
- updated in case of new services and new APIs
- customer managed policy
- best practice, re usable, can be applied to many principals
- version controlled + rollback, central change management
- inline
- strict one to one relationship between policy and principal
- policy is deleted if you delete the IAM principal
granting a user permissions to pass a role to an AWS service
- to configure many services, you must pass an IAM role to the service
- the service will later assume the role and perform actions
- for this, you need the IAM permission
iam:PassRole
- it often comes with
iam:GetRole
to view the role being passed
can a role be passed to any service?
- no: roles can only be passed to what their trust allows
- a trust policy for the role that allows the service to assume the role
Directory service - overview
- AWS managed Microsoft AD
- create your own AD in AWS, manage users locally, supports MFA
- establish trust connections with your on permise AD
- AD connector
- directory gateway to redirect to on premises AD
- users are managed on the on premises AD only
- Simple AD
- AD compatible managed directory on AWS
- cannot be joined with on premises AD
KMS
Encryption
Encryption in flight
- data is encrypted before sending and decrypted after receiving
- SSL certificate help with encryption
- encryption in flight ensures no MITM can happen
server side encryption at rest
- data is encrypted after being received by the server
- data is decrypted before being sent
- it is stored in an encrypted form thanks to a key
- the encryption / decryption keys must be managed somewhere and the server must have access to it
Client side encryption
- data is encrypted by the client and never decrypted by the server
- data will be decrypted by a receiving client
- the server should not be able to decrypt the data
- could leverage Envelop encryption
AWS KMS
- fully integrated with IAM for authorization
- seamlessly integrated into
- EBS
- S3
- RedShift
- RDS
- SSM
- but you can also use CLI / SDK
- the value in KMS is the CMK used to encrypt data can never be retrieved by user, and the CMK can be rotated for extra security
- KMS can only help in encryping up to 4KB of data per call, if data > 4KB, we need to use Envelope encryption
- to give access to KMS to someone
- make sure the key policy allows the user
- make sure the IAM policy allows the API calls
CMK Types
- Symmetric
- first offering of KMS, single encryption key that is used to encrypt and decrypt
- AWS services that are integrated with KMS use Symmetric CMKs
- necessary for envelope encryption
- you never get access to the key uncrypted (must call KMS API to use)
- Asymmetric
- public and private key pair
- used for encrypt and decrypt
- the public key is downloadable, but you can’t access the private key unencrypted
- use case: encryption outside of AWS by users who can’t call the KMS API
KMS key policies
- control access to KMS keys, similar to S3 bucket policies
- difference: you cannot control access without them
- default KMS key policy
- created if you don’t provide a specific key policy
- compelete access to the key to the root user, which means all IAM users can access the key
- gives access to the IAM policies to the KMS key
- custom KMS key policy
- define users, roles that can access the KMS key
- define who can administer the key
- helpful for cross account access of your KMS key
copying snapshots across accounts
- create a snapshot, encrypted with your own CMK
- attach a KMS key policy to authorize cross account access
- share the encrypted snapshot
- create a copy of the snapshot, encrypt it with a KMS key in your account
- create a volume from the snapshot
Envelope encryption
- KMS encrypt API call has limit of 4kb
- if you want to encrypt > 4KB, we need to user envelope encryption
- the main API that will help us is the
GenerateDataKey
API - steps
- Encryption
- call GenerateDataKey API to get the plaintext date key and encrypted data key (encrypted using your CMK)
- encrypt the big file using the plaintext data key on your local machine (client side)
- create an envelope includes the encrypted date key and the encrypted big file
- decryption
- call Decrypt API, send the encrypted data key to KMS to decrypt using your own CMK
- plaintext data key will be returned
- use the plaintext data key to decrypt your encrypted big file.
- Encryption
Encryption SDK
- the Encryption SDK implemented envelope encryption for us
- the encryption SDK also exists as a CLI tool we can install
- feature - data key caching
- re use data keys instead of creating new data keys for each encryption
- helps with reducing the number of API calls to KMS with a security trade off
KMS symmetric - API summary
- encrypt: up to 4KB
- GenerateDataKey: generates a unique symmetric data key
- returns a plaintext copy of the data key
- and a copy that is encrypted under the CMK that you specify
- decrypt: decrypt up to 4KB of data (including data encryption keys)
- GenerateRamdom: returns a random byte string
Quota limits
- when you exceed a request quota, you get a
ThrottlingException
- to respond, use exponential backoff
- for crytographic operations, they share the same quota
- this includes requests made by AWS on your behalf
- for GenerateDataKey, consider using DEK caching from the encryption SDK
- you can also request quotas increase through AWS support
SSE-KMS deep dive
- SSE-KMS leverages the GenerateDataKey and Decrypt KMS API calls
- these KMS API calls will show up in CloudTrail, helpful for logging
- to perform SSE-KMS, you need
- a KMS key policy that authorize the user / role (so we could use the key)
- an IAM policy that authorizes access to KMS (so we could access the AWS KMS service)
- otherwise you will get an access denied error
- S3 calls to KMS for SSE-KMS count against your KMS limits
- if throttling, try exponential backoff
- or request an increase in KMS limits
S3 bucket policies - force SSL
- to force SSL, create an S3 bucket policy with a DENY on the condition
aws:SecureTransport=false
S3 bucket policy - force encryption of SSE-KMS
- deny incorrect encryption header: make sure it includes
aws:kms
- deny no encryption header to ensure objects are not uploaded un encrypted
- we could also use S3 default encryption of SSE-KMS, in this case, we don’t need the second policy.
S3 bucket key for SSE-KMS encryption
- we could enable S3 bucket key to reduce the API calls to KMS directly
- the key is used to encrypt kMS objects with new data keys using envelope encryption
- you will see less KMS cloudtrail events
SSM Parameter Store
- secure storage for configuration and secrets
- optional seamless encryption using KMS
- serverless, scalable, durable, easy sdk
- version tracking of configurations / secrets
- configuration management using path and IAM
- notifications with CloudWatch events
- integration with CloudFormation
Parameter policies
- allow to assign a TTL to a parameter to force updating or deleting sensitive data
- can assign multiple policies at a time
Secrets Manager
- Newer service, meant for storing secrets
- capability to force rotation of secrets every X days
- automate generation of secrets on rotation using lambda function
- integration with RDS
- secrets are encrypted using KMS
- mostly meant for RDS integration
SSM Parameter store vs secrets manager
- secrets manager
- automatic rotation of secrets with lambda
- lambda function is provided for RDS, Redshift…
- KMS encryption is mandatory
- SSM parameter store
- simple API
- no secret rotation (can be implemented using CloudWatch events and lambda)
- KMS encryption is optional
- can pull a secrets manager secrets using the SSM parameter Store API
CloudWatch logs - encryption
- you can encrypt CloudWatch logs with KMS keys
- encryption is enabled at the log group level, by associating a CMK with a log group, either when you create the log group or after it exists
- you cannot associate a CMK with a log group using the CloudWatch console, have to use CLI
- you must use the CloudWatch logs API
associate-kms-key
: if the log group already existscreate-log-group
: if the log group doesn’t exist yet
ACM (AWS certificate manager)
- provision, manage, and deploy SSL / TLS certificates
- used to provide in flight encryption for websites
- supports both public and private TLS certificates
- free of charge for public TLS certificates
- automatic TLS certificate renewal
- integration with
- ELB
- CloudFront
- APIs on API Gateway