aws

Data & Analytics

ayleeee 2024. 4. 8. 02:14

Amazon Athena

  • Serverless query service to analyze data stored in Amazon S3
  • Uses standard SQL languages to query the files
    • 표준 SQL을 사용하여 S3에 있는 데이터를 직접 간편하게 분석할 수 있는 대화형 쿼리 서비스
  • Supports CSV, JSON, ORC, Avro, and Parquet
  • Pricing : $5.00 per TB of data scanned
  • Commonly used with Amazon Quicksight for reporting/dashboards
    • 간편한 데이터 시각화
  • Performance Improvement
    • User columnar data for cost-savings
      • Apache Parquet or ORC is recommended
      • Huge performance improvement
      • Use Glue to convert the data to Parquet or ORC
    • Compress data for smaller retrievals
    • Partition datasets in S3 for easy querying on virtual columns
    • Use larger files to minimize overhead
  • Federated Query
    • Allows to run SQL queries across data stored in relational, non-relational, object and custom data sources
    • Uses Data Source Connectors that run on AWS Lambda to run Federated Queries
    • Store the results back in Amazon S3

Redshift 

  • Based on PostrgreSQL, but it's not used for OLTP
  • OLAP - Online Analytical Processing
  • 10x better performance than other data warehouses, scale to PBs of data
  • Columnar storage of data & parallel query engine
  • Pay as you go based on the instances provisioned
  • Has a SQL interface for performing the queries
  • BI tools such as Amazon Quicksight or Tableau integrate with it
  • vs Athena : faster queries / joins / aggregations thanks to indexes
  • Cluster
    • Leader node : for query planning, results aggregation
    • Compute node : for performing the queries, send results to leader
    • Provision the node size in advance
    • Can used Reserved Instances for cost savings
  • Snapshots & DR
    • Has "Multi-AZ" mode for some clusters
    • Snapshots are point-in-time backups of a cluster, stored internally in S3
    • Snapshots are incremental
    • Can restore a snapshot into a new cluster
    • Automated 
      • every 8 hrs, every 5 GB, or on a schedule
      • set retention between 1 to 35 days
    • Manual
      • Snapshot is retained until you delete it
    • Can configure Amazon Redshift to automatically copy snapshots of a cluster to another AWS Region
  • Spectrum
    • Query data that is already in S3 without loading it
    • Must have a Redshift cluster available to start the query
    • The query is then submitted to thousands of Redshift Spectrum nodes

Amazon OpenSearch Service

  • Amazon OpenSearch is successor to Amazon ElasticSearch
  • In DynamoDB , queires only exist by primary key or indexes
  • With OpenSearch, you can search any field, even partially matches
  • It's common to use OpenSearch as a complement to another database
  • Two modes : managed cluster or serverless cluster
  • Does not natively support SQL
  • Ingestion from Kinesis Data Firehose, AWS IoT, and CloudWatch Logs
  • Security through Cognito & IAM, KMS encryption, TLS
  • Comes with OpenSearch Dashboards

Amazon EMR

  • Elastic Map Reduce
  • Helps creating Hadoop clusters to analyze and process vast amount of data
  • The clusters can be made of hundreds of EC2 instances
  • EMR comes bundled with Apache Spark, HBase, Presto, Flink ...
  • EMR takes care of all the provisioning and configuration
  • Auto-sacling and integrated with Spot instances
  • Use Cases :
    • data processing, machine learning, web indexing, big data
  • Node types & purchasing
    • Master Node : Manage the cluster, coordinate, manage health - long running
    • Core Node : Run tasks and store data - long running
    • Task Node : Just to run tasks - usually Spot
    • Purchasing options :
      • On-demand : reliable, predictable, won't be terminated
      • Reserved (min 1 yr) : cost savings
      • Spot Instances : cheaper, can be terminated, less reliable
    • Can have long-running cluster, or transient cluster

Amazon QuickSight

  • Serverless machine learning-powered business intelligence service to create interactive dashboards
  • Fast, automatically scalable, embeddable, with per-session pricing
  • Use Case :
    • Business analytics
    • Building visualization
    • Perform ad-hoc analysis
    • Get business insights using data
  • Integrated with RDS, Aurora, Athena, Redshift, S3
  • In-memory computation using SPICE engine if data is imported into QuickSight
  • Enterprise edition : Possibility to setup Column-Level security
  • Dashboard & Analysis
    • Define Users and Groups
      • only exist within QuickSight
    • A dashboard
      • is a read-only snapshot of an analysis that you can share
      • preserves the configuration of the analysis
    • You can share the analysis or the dashboard with Users or Groups
    • To share a dashboard, you must first publish it
    • Users who see the dashboard can also see the underlying data

AWS Glue

  • Managed extract, transform, and load(ETL) service
  • Useful to prepare and transform data for analytics
  • Fully serverless service
  • Things to know at a high-level
    • Glue Job Bookmarks : prevent re-processing old data
    • Glue Elastic Views :
      • Combine and replicate data across multiple data stores using SQL
      • No custom code, Glue monitors for changes in the source data, serverless
      • Leverages a "virtual table"
    • Glue DataBrew : clean and normalize data using pre-built transformation
    • Glue Studio : new GUI to create, run and monitor ETL jobs in Glue
    • Glue Streaming ETL : compatible with Kinesis Data Streaming, Kafka, MSK

AWS Lake Formation

  • Data lake = central place to have all your data for analytics purposes
  • Fully managed service that makes it easy to setup a data lake in days
  • Discover, cleanse, transform, and ingest data into your Data Lake
  • It automates many complex manual steps and de-duplicate
  • Combine structured and unstructured data in the data lake
  • Out-of-the-box source blueprints
  • Fine-grained Access Control for your applications
  • Built on top of AWS Glue

Kinesis Data Analytics (SQL application)

  • Real-time analytics on Kinesis Data Streams & Firehose using SQL
  • Add reference data from Amazon S3 to enrich streaming data
  • Fully managed, no serves to provision
  • Automatic scaling
  • Pay for actual consumption rate
  • Output : 
    • Kinesis Data Streams : create streams out of the real-time analytics queries
    • Kinesis Data Firehose : send analytics query results to destinations
  • Use Case :
    • Time-series analytics
    • Real-time dashboards
    • Real-time metrics

Kinesis Data Analytics for Apache Flink

  • Use Flink to process and analyze streaming data
  • Run any Apache Flink application on a manged cluster on AWS
    • provisioning compute resources, parallel computation, automatic scaling
    • application backups (implemented as checkpoints and snapshots)
    • Use any Apahce Flink programming features
    • Flink does not read from Firehose

Amazon Managed Streaming for Apache Kafka (Amazon MSK)

  • Alternative to Amazon Kinesis
  • Fully managed Apache Kafka on AWS
    • Allow to create, update, delete clusters
    • MSK creates & manages kafka brokers nodes & Zookeeper nodes for you
    • Deploy the MSK cluster in yourVPC, multi-AZ (up to 3 for HA)
    • Automatic recovery from common Apache Kafka failures
    • Data is stored on EBS volumes for as long as you want
  • MSK Serverless
    • Run Apahce Kafka on MSK without managing the capacity
    • MSK automatically provisions resources and scales compute & storage

Kinesis Data Streams vs Amazon MSK

  • Kinesis Data Streams
    • IMB message size limit
    • Data Streams with Shards
    • Shard Splitting & Merging
    • TLS In-flight encryption
    • KMS at-rest encryption
  • Amazon MSK
    • IMB default, configure for higher
    • Kafka Topics with Partitions
    • Can only add partitions to a topic
    • PLAINTEXT or TLS In-flight Encryption
    • KMS at-rest encryption

'aws' 카테고리의 다른 글

AWS Monitoring, Audit and Performance  (2) 2024.04.14
Machine Learning  (1) 2024.04.10
Databases in AWS  (3) 2024.04.07
Serverless Overview  (0) 2024.04.06
Containers on AWS  (0) 2024.04.05