Topic 1 Question 82

Professional Machine Learning Engineer

Topic 1 Question 82
You are profiling the performance of your TensorFlow model training time and notice a performance issue caused by inefficiencies in the input data pipeline for a single 5 terabyte CSV file dataset on Cloud Storage. You need to optimize the input pipeline performance. Which action should you try first to increase the efficiency of your pipeline?
- Preprocess the input CSV file into a TFRecord file.
- Randomly select a 10 gigabyte subset of the data to train your model.
- Split into multiple CSV files and use a parallel interleave transformation.
- Set the reshuffle_each_iteration parameter to true in the tf.data.Dataset.shuffle method.
ユーザの投票
コメント(12)
- Could anyone be kind to explain why C is preferred over A? My initial guess was on A, but everyone here seems to unanimously prefer C. Is it because it is not about optimizing I/O performance, but rather the input pipeline, which is about processing arrived data within that TF input pipeline (non-I/O)? I just try to understand here. Thanks for reply in advance!
  
  👍 4
  SMASL2023/02/14
- 正解だと思う選択肢: A
  Option B (randomly selecting a 10 gigabyte subset of the data) could lead to a loss of useful data and may not be representative of the entire dataset. Option C (splitting into multiple CSV files and using a parallel interleave transformation) may also improve the performance, but may be more complex to implement and maintain, and may not be as efficient as converting to TFRecord. Option D (setting the reshuffle_each_iteration parameter to true in the tf.data.Dataset.shuffle method) is not directly related to the input data format and may not provide as significant a performance improvement as converting to TFRecord.
  
  👍 3
  shankalman7172023/02/22
- 正解だと思う選択肢: C
  Splitting the file we can use parallel interleave to parallel load the datasets https://www.tensorflow.org/guide/data_performance
  
  👍 2
  LearnSodas2022/12/11
シャッフルモード

ユーザの投票

コメント(12)