Amazon Athena
- Serverless query service to analyze data stored in Amazon S3
- Uses standard SQL languages to query the files
- 표준 SQL을 사용하여 S3에 있는 데이터를 직접 간편하게 분석할 수 있는 대화형 쿼리 서비스
- Supports CSV, JSON, ORC, Avro, and Parquet
- Pricing : $5.00 per TB of data scanned
- Commonly used with Amazon Quicksight for reporting/dashboards
- 간편한 데이터 시각화
- Performance Improvement
- User columnar data for cost-savings
- Apache Parquet or ORC is recommended
- Huge performance improvement
- Use Glue to convert the data to Parquet or ORC
- Compress data for smaller retrievals
- Partition datasets in S3 for easy querying on virtual columns
- Use larger files to minimize overhead
- User columnar data for cost-savings
- Federated Query
- Allows to run SQL queries across data stored in relational, non-relational, object and custom data sources
- Uses Data Source Connectors that run on AWS Lambda to run Federated Queries
- Store the results back in Amazon S3
Redshift
- Based on PostrgreSQL, but it's not used for OLTP
- OLAP - Online Analytical Processing
- 10x better performance than other data warehouses, scale to PBs of data
- Columnar storage of data & parallel query engine
- Pay as you go based on the instances provisioned
- Has a SQL interface for performing the queries
- BI tools such as Amazon Quicksight or Tableau integrate with it
- vs Athena : faster queries / joins / aggregations thanks to indexes
- Cluster
- Leader node : for query planning, results aggregation
- Compute node : for performing the queries, send results to leader
- Provision the node size in advance
- Can used Reserved Instances for cost savings
- Snapshots & DR
- Has "Multi-AZ" mode for some clusters
- Snapshots are point-in-time backups of a cluster, stored internally in S3
- Snapshots are incremental
- Can restore a snapshot into a new cluster
- Automated
- every 8 hrs, every 5 GB, or on a schedule
- set retention between 1 to 35 days
- Manual
- Snapshot is retained until you delete it
- Can configure Amazon Redshift to automatically copy snapshots of a cluster to another AWS Region
- Spectrum
- Query data that is already in S3 without loading it
- Must have a Redshift cluster available to start the query
- The query is then submitted to thousands of Redshift Spectrum nodes
Amazon OpenSearch Service
- Amazon OpenSearch is successor to Amazon ElasticSearch
- In DynamoDB , queires only exist by primary key or indexes
- With OpenSearch, you can search any field, even partially matches
- It's common to use OpenSearch as a complement to another database
- Two modes : managed cluster or serverless cluster
- Does not natively support SQL
- Ingestion from Kinesis Data Firehose, AWS IoT, and CloudWatch Logs
- Security through Cognito & IAM, KMS encryption, TLS
- Comes with OpenSearch Dashboards
Amazon EMR
- Elastic Map Reduce
- Helps creating Hadoop clusters to analyze and process vast amount of data
- The clusters can be made of hundreds of EC2 instances
- EMR comes bundled with Apache Spark, HBase, Presto, Flink ...
- EMR takes care of all the provisioning and configuration
- Auto-sacling and integrated with Spot instances
- Use Cases :
- data processing, machine learning, web indexing, big data
- Node types & purchasing
- Master Node : Manage the cluster, coordinate, manage health - long running
- Core Node : Run tasks and store data - long running
- Task Node : Just to run tasks - usually Spot
- Purchasing options :
- On-demand : reliable, predictable, won't be terminated
- Reserved (min 1 yr) : cost savings
- Spot Instances : cheaper, can be terminated, less reliable
- Can have long-running cluster, or transient cluster
Amazon QuickSight
- Serverless machine learning-powered business intelligence service to create interactive dashboards
- Fast, automatically scalable, embeddable, with per-session pricing
- Use Case :
- Business analytics
- Building visualization
- Perform ad-hoc analysis
- Get business insights using data
- Integrated with RDS, Aurora, Athena, Redshift, S3
- In-memory computation using SPICE engine if data is imported into QuickSight
- Enterprise edition : Possibility to setup Column-Level security
- Dashboard & Analysis
- Define Users and Groups
- only exist within QuickSight
- A dashboard
- is a read-only snapshot of an analysis that you can share
- preserves the configuration of the analysis
- You can share the analysis or the dashboard with Users or Groups
- To share a dashboard, you must first publish it
- Users who see the dashboard can also see the underlying data
- Define Users and Groups
AWS Glue
- Managed extract, transform, and load(ETL) service
- Useful to prepare and transform data for analytics
- Fully serverless service
- Things to know at a high-level
- Glue Job Bookmarks : prevent re-processing old data
- Glue Elastic Views :
- Combine and replicate data across multiple data stores using SQL
- No custom code, Glue monitors for changes in the source data, serverless
- Leverages a "virtual table"
- Glue DataBrew : clean and normalize data using pre-built transformation
- Glue Studio : new GUI to create, run and monitor ETL jobs in Glue
- Glue Streaming ETL : compatible with Kinesis Data Streaming, Kafka, MSK
AWS Lake Formation
- Data lake = central place to have all your data for analytics purposes
- Fully managed service that makes it easy to setup a data lake in days
- Discover, cleanse, transform, and ingest data into your Data Lake
- It automates many complex manual steps and de-duplicate
- Combine structured and unstructured data in the data lake
- Out-of-the-box source blueprints
- Fine-grained Access Control for your applications
- Built on top of AWS Glue
Kinesis Data Analytics (SQL application)
- Real-time analytics on Kinesis Data Streams & Firehose using SQL
- Add reference data from Amazon S3 to enrich streaming data
- Fully managed, no serves to provision
- Automatic scaling
- Pay for actual consumption rate
- Output :
- Kinesis Data Streams : create streams out of the real-time analytics queries
- Kinesis Data Firehose : send analytics query results to destinations
- Use Case :
- Time-series analytics
- Real-time dashboards
- Real-time metrics
Kinesis Data Analytics for Apache Flink
- Use Flink to process and analyze streaming data
- Run any Apache Flink application on a manged cluster on AWS
- provisioning compute resources, parallel computation, automatic scaling
- application backups (implemented as checkpoints and snapshots)
- Use any Apahce Flink programming features
- Flink does not read from Firehose
Amazon Managed Streaming for Apache Kafka (Amazon MSK)
- Alternative to Amazon Kinesis
- Fully managed Apache Kafka on AWS
- Allow to create, update, delete clusters
- MSK creates & manages kafka brokers nodes & Zookeeper nodes for you
- Deploy the MSK cluster in yourVPC, multi-AZ (up to 3 for HA)
- Automatic recovery from common Apache Kafka failures
- Data is stored on EBS volumes for as long as you want
- MSK Serverless
- Run Apahce Kafka on MSK without managing the capacity
- MSK automatically provisions resources and scales compute & storage
Kinesis Data Streams vs Amazon MSK
- Kinesis Data Streams
- IMB message size limit
- Data Streams with Shards
- Shard Splitting & Merging
- TLS In-flight encryption
- KMS at-rest encryption
- Amazon MSK
- IMB default, configure for higher
- Kafka Topics with Partitions
- Can only add partitions to a topic
- PLAINTEXT or TLS In-flight Encryption
- KMS at-rest encryption
'aws' 카테고리의 다른 글
| AWS Monitoring, Audit and Performance (2) | 2024.04.14 |
|---|---|
| Machine Learning (1) | 2024.04.10 |
| Databases in AWS (3) | 2024.04.07 |
| Serverless Overview (0) | 2024.04.06 |
| Containers on AWS (0) | 2024.04.05 |