Topic 1 Question 225

Professional Data Engineer

Topic 1 Question 225
Your organization stores customer data in an on-premises Apache Hadoop cluster in Apache Parquet format. Data is processed on a daily basis by Apache Spark jobs that run on the cluster. You are migrating the Spark jobs and Parquet data to Google Cloud. BigQuery will be used on future transformation pipelines so you need to ensure that your data is available in BigQuery. You want to use managed services, while minimizing ETL data processing changes and overhead costs. What should you do?
- Migrate your data to Cloud Storage and migrate the metadata to Dataproc Metastore (DPMS). Refactor Spark pipelines to write and read data on Cloud Storage, and run them on Dataproc Serverless.
- Migrate your data to Cloud Storage and register the bucket as a Dataplex asset. Refactor Spark pipelines to write and read data on Cloud Storage, and run them on Dataproc Serverless.
- Migrate your data to BigQuery. Refactor Spark pipelines to write and read data on BigQuery, and run them on Dataproc Serverless.
- Migrate your data to BigLake. Refactor Spark pipelines to write and read data on Cloud Storage, and run them on Dataproc on Compute Engine.
ユーザの投票
コメント(2)
- 正解だと思う選択肢: A
  Managed Services: Leverages Dataproc Serverless for a fully managed Spark environment, reducing overhead and administrative tasks. Minimal Data Processing Changes: Keeps Spark pipelines largely intact by working with Parquet files on Cloud Storage, minimizing refactoring efforts. BigQuery Integration: Dataproc Serverless can directly access BigQuery, enabling future transformation pipelines without additional data movement. Cost-Effective: Serverless model scales resources only when needed, optimizing costs for intermittent workloads.
  
  👍 2
  e70ea9e2023/12/30
- 正解だと思う選択肢: A
  This option involves moving Parquet files to Cloud Storage, which is a common and cost-effective storage solution for big data and is compatible with Spark jobs.
  
  Using Dataproc Metastore to manage metadata allows us to keep Hadoop ecosystem's structural information.
  
  Running Spark jobs on Dataproc Serverless takes advantage of managed Spark services without managing clusters.
  
  Once the data is in Cloud Storage, you can also easily load it into BigQuery for further analysis.
  👍 2
  raaad2024/01/04
シャッフルモード

ユーザの投票

コメント(2)