Topic 1 Question 155

Professional Machine Learning Engineer

Topic 1 Question 155
You are training an object detection machine learning model on a dataset that consists of three million X-ray images, each roughly 2 GB in size. You are using Vertex AI Training to run a custom training application on a Compute Engine instance with 32-cores, 128 GB of RAM, and 1 NVIDIA P100 GPU. You notice that model training is taking a very long time. You want to decrease training time without sacrificing model performance. What should you do?
- Increase the instance memory to 512 GB, and increase the batch size.
- Replace the NVIDIA P100 GPU with a K80 GPU in the training job.
- Enable early stopping in your Vertex AI Training job.
- Use the tf.distribute.Strategy API and run a distributed training job.
ユーザの投票
コメント(5)
- 正解だと思う選択肢: B
  The same comment as in Q96. If we look at our training infrastructure, we can see the bottleneck is obviously the GPU, which has 12GB or 16GB memory depending on the model (https://www.leadtek.com/eng/products/ai_hpc(37)/tesla_p100(761)/detail). This means we can afford to have a batch size of only 6-8 images (2GB each) even if we assume the GPU is utilized 100% and model weights take 0 memory. And remember the training size is 3M, which means each epoch will have 375-500K steps even in this unlikely best case.
  
  With 32-cores and 128GB memory, we are able to afford higher batch sizes (e.g., 32), so moving to a K80 GPU that has 24GB of memory will accelerate the training.
  
  A is wrong because we can't afford a larger batch size with the current GPU. D is wrong because you don't have multiple GPUs and your current GPU is saturated. C is a viable option, but it seems less optimal than B.
  
  👍 2
  [Removed]2023/07/25
- 正解だと思う選択肢: A
  A since we just have one gpu, we could not use tf.distribute.Strategy in D
  
  👍 1
  powerby352023/07/13
- 正解だと思う選択肢: D
  to decrease training time without sacrificing model performance, the best approach is to use the tf.distribute.Strategy API and run a distributed training job, leveraging the capabilities of the available GPU(s) for parallelized training.
  
  👍 1
  PST212023/07/20
シャッフルモード

ユーザの投票

コメント(5)