Topic 2 Question 1
A Machine Learning Specialist must build out a process to query a dataset on Amazon S3 using Amazon Athena. The dataset contains more than 800,000 records stored as plaintext CSV files. Each record contains 200 columns and is approximately 1.5 MB in size. Most queries will span 5 to 10 columns only. How should the Machine Learning Specialist transform the dataset to minimize query runtime?
Convert the records to Apache Parquet format.
Convert the records to JSON format.
Convert the records to GZIP CSV format.
Convert the records to XML format.
解説
Using compressions will reduce the amount of data scanned by Amazon Athena, and also reduce your S3 bucket storage. It's a Win-Win for your AWS bill. Supported formats: GZIP, LZO, SNAPPY (Parquet) and ZLIB. Reference: https://www.cloudforecast.io/blog/using-parquet-on-athena-to-save-money-on-aws/
コメント(2)
A... Is the Answer
👍 5viduvivek2021/09/24A is correct. Parquet format leads to a columnar structure. Say you have 5 cols than you end up with 5 files. This will dramatically reduce scanning time AND cost since Athena is billed on amout of data scanned.
👍 5ChKl2021/10/12
シャッフルモード