Topic 1 Question 88

Professional Data Engineer

Topic 1 Question 88
Your team is responsible for developing and maintaining ETLs in your company. One of your Dataflow jobs is failing because of some errors in the input data, and you need to improve reliability of the pipeline (incl. being able to reprocess all failing data). What should you do?
- Add a filtering step to skip these types of errors in the future, extract erroneous rows from logs.
- Add a tryג€¦ catch block to your DoFn that transforms the data, extract erroneous rows from logs.
- Add a tryג€¦ catch block to your DoFn that transforms the data, write erroneous rows to Pub/Sub PubSub directly from the DoFn.
- Add a tryג€¦ catch block to your DoFn that transforms the data, use a sideOutput to create a PCollection that can be stored to Pub/Sub later.
ユーザの投票
コメント(15)
- The error records are directly written to PubSub from the DoFn (it’s equivalent in python). You cannot directly write a PCollection to PubSub. You have to extract each record and write one at a time. Why do the additional work and why not write it using PubSubIO in the DoFn itself? You can write the whole PCollection to Bigquery though, as explained in
  
  Reference: https://medium.com/google-cloud/dead-letter-queues-simple-implementation-strategy-for-cloud-pub-sub-80adf4a4a800
  
  👍 6
  nickyshil2022/09/28
- 正解だと思う選択肢: D
  C is a big NO. Writing to PubSub in DoFn will cause bottleneck in the pipeline. For IO, we should always use those IO lib (e.g PubsubIO) Using sideOutput is the correct answer here. There is a Qwiklab about this. It is recommended to do that lab to understand more.
  
  👍 5
  midgoo2023/03/01
- Answer C
  
  👍 4
  nickyshil2022/09/28
シャッフルモード

ユーザの投票

コメント(15)