Topic 1 Question 178

Professional Data Engineer

Topic 1 Question 178
You are testing a Dataflow pipeline to ingest and transform text files. The files are compressed gzip, errors are written to a dead-letter queue, and you are using SideInputs to join data. You noticed that the pipeline is taking longer to complete than expected; what should you do to expedite the Dataflow job?
- Switch to compressed Avro files.
- Reduce the batch size.
- Retry records that throw an error.
- Use CoGroupByKey instead of the SideInput.
ユーザの投票
コメント(12)
- 正解だと思う選択肢: D
  D: it is most likely. There are a lot of reference doc to tell about comparison between them https://cloud.google.com/architecture/building-production-ready-data-pipelines-using-dataflow-developing-and-testing#choose_correctly_between_side_inputs_or_cogroupbykey_for_joins
  
  https://cloud.google.com/blog/products/data-analytics/guide-to-common-cloud-dataflow-use-case-patterns-part-2
  
  https://stackoverflow.com/questions/58080383/sideinput-i-o-kills-performance
  
  👍 14
  John_Pongthorn2022/09/25
- 正解だと思う選択肢: D
  D is the answer.
  
  https://cloud.google.com/architecture/building-production-ready-data-pipelines-using-dataflow-developing-and-testing#choose_correctly_between_side_inputs_or_cogroupbykey_for_joins The CoGroupByKey transform is a core Beam transform that merges (flattens) multiple PCollection objects and groups elements that have a common key. Unlike a side input, which makes the entire side input data available to each worker, CoGroupByKey performs a shuffle (grouping) operation to distribute data across workers. CoGroupByKey is therefore ideal when the PCollection objects you want to join are very large and don't fit into worker memory.
  
  Use CoGroupByKey if you need to fetch a large proportion of a PCollection object that significantly exceeds worker memory.
  
  👍 6
  zellck2022/11/29
- 正解だと思う選択肢: B
  B. Reduce the batch size.
  
  👍 3
  AWSandeep2022/09/02
シャッフルモード

ユーザの投票

コメント(12)