Topic 1 Question 4

Professional Machine Learning Engineer

Topic 1 Question 4
You want to rebuild your ML pipeline for structured data on Google Cloud. You are using PySpark to conduct data transformations at scale, but your pipelines are taking over 12 hours to run. To speed up development and pipeline run time, you want to use a serverless tool and SQL syntax. You have already moved your raw data into Cloud Storage. How should you build the pipeline on Google Cloud while meeting the speed and processing requirements?
- Use Data Fusion's GUI to build the transformation pipelines, and then write the data into BigQuery.
- Convert your PySpark into SparkSQL queries to transform the data, and then run your pipeline on Dataproc to write the data into BigQuery.
- Ingest your data into Cloud SQL, convert your PySpark commands into SQL queries to transform the data, and then use federated queries from BigQuery for machine learning.
- Ingest your data into BigQuery using BigQuery Load, convert your PySpark commands into BigQuery SQL queries to transform the data, and then write the transformations to a new table.
ユーザの投票
コメント(17)
- It should be D .... Data Fusion is not SQL syntax ....
  
  👍 17
  nunzio1442021/07/22
- ANS: A https://cloud.google.com/data-fusion#section-1
  
  Data Fusion is a serverless approach leveraging the scalability and reliability of Google services like Dataproc means Data Fusion offers the best of data integration capabilities with a lower total cost of ownership.
  
  BigQuery is serverless and supports SQL.
  
  Dataproc is not serverless, you have to manage clusters.
  
  Cloud SQL is not serverless, you have to manage instances.
  👍 10
  Celia202107142021/07/18
- 正解だと思う選択肢: D
  Data Fusion is not in SQL syntax, so no A; Dataproc is not serverless, so no B; Passing through Cloud SQL is uselss, just go with BigQuery, so no C; D is correct
  
  👍 3
  EFIGO2022/11/23
シャッフルモード

ユーザの投票

コメント(17)