Topic 1 Question 87

Professional Data Engineer

Topic 1 Question 87
You've migrated a Hadoop job from an on-prem cluster to dataproc and GCS. Your Spark job is a complicated analytical workload that consists of many shuffling operations and initial data are parquet files (on average 200-400 MB size each). You see some degradation in performance after the migration to Dataproc, so you'd like to optimize for it. You need to keep in mind that your organization is very cost-sensitive, so you'd like to continue using Dataproc on preemptibles (with 2 non-preemptible workers only) for this workload. What should you do?
- Increase the size of your parquet files to ensure them to be 1 GB minimum.
- Switch to TFRecords formats (appr. 200MB per file) instead of parquet files.
- Switch from HDDs to SSDs, copy initial data from GCS to HDFS, run the Spark job and copy results back to GCS.
- Switch from HDDs to SSDs, override the preemptible VMs configuration to increase the boot disk size.
ユーザの投票
コメント(17)
- Should be A:
  
  https://stackoverflow.com/questions/42918663/is-it-better-to-have-one-large-parquet-file-or-lots-of-smaller-parquet-files https://www.dremio.com/tuning-parquet/
  
  C & D will improve performance but need to pay more $$
  
  👍 63
  rickywck2020/03/17
- Answer should be D
  
  👍 12
  madhu11712020/03/14
- 正解だと思う選択肢: D
  Option D is correct
  
  Elimination Strategy:- A. Increase the size of your parquet files to ensure them to be 1 GB minimum (doesn’t make sense as the file size are fit for migration to proceed with given scenario, recommended size is between 128 MB to 1 GB.) B. Switch to TFRecords formats (appr. 200MB per file) instead of parquet files(doesn’t make sense to make changes to file format ) C. Switch from HDDs to SSDs, copy initial data from GCS to HDFS, run the Spark job and copy results back to GCS(doesn’t make sense to copy the file from GCS to HDFS as the workload that consists of many shuffling operations) D. Switch from HDDs to SSDs, override the preemptible VMs configuration to increase the boot disk size(perfect fit as the workload that consists of many shuffling operations which requires attention to increase the performance reference doc:- https://cloud.google.com/architecture/hadoop/migrating-apache-spark-jobs-to-cloud-dataproc#optimize_performance )
  
  👍 3
  dish11dish2022/11/22
シャッフルモード

ユーザの投票

コメント(17)