Data & Analytics

aws

Data & Analytics

ayleeee 2024. 4. 8. 02:14

Amazon Athena

Serverless query service to analyze data stored in Amazon S3
Uses standard SQL languages to query the files
- 표준 SQL을 사용하여 S3에 있는 데이터를 직접 간편하게 분석할 수 있는 대화형 쿼리 서비스
Supports CSV, JSON, ORC, Avro, and Parquet
Pricing : $5.00 per TB of data scanned
Commonly used with Amazon Quicksight for reporting/dashboards
- 간편한 데이터 시각화
Performance Improvement
- User columnar data for cost-savings
  - Apache Parquet or ORC is recommended
  - Huge performance improvement
  - Use Glue to convert the data to Parquet or ORC
- Compress data for smaller retrievals
- Partition datasets in S3 for easy querying on virtual columns
- Use larger files to minimize overhead
Federated Query
- Allows to run SQL queries across data stored in relational, non-relational, object and custom data sources
- Uses Data Source Connectors that run on AWS Lambda to run Federated Queries
- Store the results back in Amazon S3

Redshift

Based on PostrgreSQL, but it's not used for OLTP
OLAP - Online Analytical Processing
10x better performance than other data warehouses, scale to PBs of data
Columnar storage of data & parallel query engine
Pay as you go based on the instances provisioned
Has a SQL interface for performing the queries
BI tools such as Amazon Quicksight or Tableau integrate with it
vs Athena : faster queries / joins / aggregations thanks to indexes
Cluster
- Leader node : for query planning, results aggregation
- Compute node : for performing the queries, send results to leader
- Provision the node size in advance
- Can used Reserved Instances for cost savings
Snapshots & DR
- Has "Multi-AZ" mode for some clusters
- Snapshots are point-in-time backups of a cluster, stored internally in S3
- Snapshots are incremental
- Can restore a snapshot into a new cluster
- Automated
  - every 8 hrs, every 5 GB, or on a schedule
  - set retention between 1 to 35 days
- Manual
  - Snapshot is retained until you delete it
- Can configure Amazon Redshift to automatically copy snapshots of a cluster to another AWS Region
Spectrum
- Query data that is already in S3 without loading it
- Must have a Redshift cluster available to start the query
- The query is then submitted to thousands of Redshift Spectrum nodes

Amazon OpenSearch Service

Amazon OpenSearch is successor to Amazon ElasticSearch
In DynamoDB , queires only exist by primary key or indexes
With OpenSearch, you can search any field, even partially matches
It's common to use OpenSearch as a complement to another database
Two modes : managed cluster or serverless cluster
Does not natively support SQL
Ingestion from Kinesis Data Firehose, AWS IoT, and CloudWatch Logs
Security through Cognito & IAM, KMS encryption, TLS
Comes with OpenSearch Dashboards

Amazon EMR

Elastic Map Reduce
Helps creating Hadoop clusters to analyze and process vast amount of data
The clusters can be made of hundreds of EC2 instances
EMR comes bundled with Apache Spark, HBase, Presto, Flink ...
EMR takes care of all the provisioning and configuration
Auto-sacling and integrated with Spot instances
Use Cases :
- data processing, machine learning, web indexing, big data
Node types & purchasing
- Master Node : Manage the cluster, coordinate, manage health - long running
- Core Node : Run tasks and store data - long running
- Task Node : Just to run tasks - usually Spot
- Purchasing options :
  - On-demand : reliable, predictable, won't be terminated
  - Reserved (min 1 yr) : cost savings
  - Spot Instances : cheaper, can be terminated, less reliable
- Can have long-running cluster, or transient cluster

Amazon QuickSight

Serverless machine learning-powered business intelligence service to create interactive dashboards
Fast, automatically scalable, embeddable, with per-session pricing
Use Case :
- Business analytics
- Building visualization
- Perform ad-hoc analysis
- Get business insights using data
Integrated with RDS, Aurora, Athena, Redshift, S3
In-memory computation using SPICE engine if data is imported into QuickSight
Enterprise edition : Possibility to setup Column-Level security
Dashboard & Analysis
- Define Users and Groups
  - only exist within QuickSight
- A dashboard
  - is a read-only snapshot of an analysis that you can share
  - preserves the configuration of the analysis
- You can share the analysis or the dashboard with Users or Groups
- To share a dashboard, you must first publish it
- Users who see the dashboard can also see the underlying data

AWS Glue

Managed extract, transform, and load(ETL) service
Useful to prepare and transform data for analytics
Fully serverless service
Things to know at a high-level
- Glue Job Bookmarks : prevent re-processing old data
- Glue Elastic Views :
  - Combine and replicate data across multiple data stores using SQL
  - No custom code, Glue monitors for changes in the source data, serverless
  - Leverages a "virtual table"
- Glue DataBrew : clean and normalize data using pre-built transformation
- Glue Studio : new GUI to create, run and monitor ETL jobs in Glue
- Glue Streaming ETL : compatible with Kinesis Data Streaming, Kafka, MSK

AWS Lake Formation

Data lake = central place to have all your data for analytics purposes
Fully managed service that makes it easy to setup a data lake in days
Discover, cleanse, transform, and ingest data into your Data Lake
It automates many complex manual steps and de-duplicate
Combine structured and unstructured data in the data lake
Out-of-the-box source blueprints
Fine-grained Access Control for your applications
Built on top of AWS Glue

Kinesis Data Analytics (SQL application)

Real-time analytics on Kinesis Data Streams & Firehose using SQL
Add reference data from Amazon S3 to enrich streaming data
Fully managed, no serves to provision
Automatic scaling
Pay for actual consumption rate
Output :
- Kinesis Data Streams : create streams out of the real-time analytics queries
- Kinesis Data Firehose : send analytics query results to destinations
Use Case :
- Time-series analytics
- Real-time dashboards
- Real-time metrics

Kinesis Data Analytics for Apache Flink

Use Flink to process and analyze streaming data
Run any Apache Flink application on a manged cluster on AWS
- provisioning compute resources, parallel computation, automatic scaling
- application backups (implemented as checkpoints and snapshots)
- Use any Apahce Flink programming features
- Flink does not read from Firehose

Amazon Managed Streaming for Apache Kafka (Amazon MSK)

Alternative to Amazon Kinesis
Fully managed Apache Kafka on AWS
- Allow to create, update, delete clusters
- MSK creates & manages kafka brokers nodes & Zookeeper nodes for you
- Deploy the MSK cluster in yourVPC, multi-AZ (up to 3 for HA)
- Automatic recovery from common Apache Kafka failures
- Data is stored on EBS volumes for as long as you want
MSK Serverless
- Run Apahce Kafka on MSK without managing the capacity
- MSK automatically provisions resources and scales compute & storage

Kinesis Data Streams vs Amazon MSK

Kinesis Data Streams
- IMB message size limit
- Data Streams with Shards
- Shard Splitting & Merging
- TLS In-flight encryption
- KMS at-rest encryption
Amazon MSK
- IMB default, configure for higher
- Kafka Topics with Partitions
- Can only add partitions to a topic
- PLAINTEXT or TLS In-flight Encryption
- KMS at-rest encryption

'aws' 카테고리의 다른 글

AWS Monitoring, Audit and Performance (2)	2024.04.14
Machine Learning (1)	2024.04.10
Databases in AWS (3)	2024.04.07
Serverless Overview (0)	2024.04.06
Containers on AWS (0)	2024.04.05

현재글Data & Analytics

dev--

Today :
Yesterday :

일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30

dev--

Data & Analytics

Amazon Athena

Redshift

Amazon OpenSearch Service

Amazon EMR

Amazon QuickSight

AWS Glue

AWS Lake Formation

Kinesis Data Analytics (SQL application)

Kinesis Data Analytics for Apache Flink

Amazon Managed Streaming for Apache Kafka (Amazon MSK)

Kinesis Data Streams vs Amazon MSK

'aws' 카테고리의 다른 글

'aws'의 다른글

티스토리툴바

Data & Analytics

Amazon Athena

Redshift

Amazon OpenSearch Service

Amazon EMR

Amazon QuickSight

AWS Glue

AWS Lake Formation

Kinesis Data Analytics (SQL application)

Kinesis Data Analytics for Apache Flink

Amazon Managed Streaming for Apache Kafka (Amazon MSK)

Kinesis Data Streams vs Amazon MSK

'aws' 카테고리의 다른 글

'aws'의 다른글

관련글

티스토리툴바