Topic 1 Question 9
You are designing a pipeline to process data files that arrive in Cloud Storage by 3:00 am each day. Data processing is performed in stages, where the output of one stage becomes the input of the next. Each stage takes a long time to run. Occasionally a stage fails, and you have to address the problem. You need to ensure that the final output is generated as quickly as possible. What should you do?
Design a Spark program that runs under Dataproc. Code the program to wait for user input when an error is detected. Rerun the last action after correcting any stage output data errors.
Design the pipeline as a set of PTransforms in Dataflow. Restart the pipeline after correcting any stage output data errors.
Design the workflow as a Cloud Workflow instance. Code the workflow to jump to a given stage based on an input parameter. Rerun the workflow after correcting any stage output data errors.
Design the processing as a directed acyclic graph (DAG) in Cloud Composer. Clear the state of the failed task after correcting any stage output data errors.
ユーザの投票
コメント(1)
- 正解だと思う選択肢: D
The best option is D. Design the processing as a directed acyclic graph (DAG) in Cloud Composer. Clear the state of the failed task after correcting any stage output data errors. Cloud Composer (Apache Airflow) is specifically designed for orchestrating complex data pipelines, provides robust error handling with task-level rerun capabilities for efficient recovery, and offers a fully managed and scheduled environment, making it the most suitable choice for ensuring the final output is generated as quickly as possible in the face of occasional stage failures. Options A, B, and C are less efficient due to manual intervention, less granular error recovery, or lack of dedicated workflow orchestration features.
👍 1n21837128472025/02/27
シャッフルモード