Topic 1 Question 82
You are profiling the performance of your TensorFlow model training time and notice a performance issue caused by inefficiencies in the input data pipeline for a single 5 terabyte CSV file dataset on Cloud Storage. You need to optimize the input pipeline performance. Which action should you try first to increase the efficiency of your pipeline?
Preprocess the input CSV file into a TFRecord file.
Randomly select a 10 gigabyte subset of the data to train your model.
Split into multiple CSV files and use a parallel interleave transformation.
Set the reshuffle_each_iteration parameter to true in the tf.data.Dataset.shuffle method.
ユーザの投票
コメント(12)
Could anyone be kind to explain why C is preferred over A? My initial guess was on A, but everyone here seems to unanimously prefer C. Is it because it is not about optimizing I/O performance, but rather the input pipeline, which is about processing arrived data within that TF input pipeline (non-I/O)? I just try to understand here. Thanks for reply in advance!
👍 4SMASL2023/02/14- 正解だと思う選択肢: A
Option B (randomly selecting a 10 gigabyte subset of the data) could lead to a loss of useful data and may not be representative of the entire dataset. Option C (splitting into multiple CSV files and using a parallel interleave transformation) may also improve the performance, but may be more complex to implement and maintain, and may not be as efficient as converting to TFRecord. Option D (setting the reshuffle_each_iteration parameter to true in the tf.data.Dataset.shuffle method) is not directly related to the input data format and may not provide as significant a performance improvement as converting to TFRecord.
👍 3shankalman7172023/02/22 - 正解だと思う選択肢: C
Splitting the file we can use parallel interleave to parallel load the datasets https://www.tensorflow.org/guide/data_performance
👍 2LearnSodas2022/12/11
シャッフルモード