Topic 1 Question 116
You work for a biotech startup that is experimenting with deep learning ML models based on properties of biological organisms. Your team frequently works on early-stage experiments with new architectures of ML models, and writes custom TensorFlow ops in C++. You train your models on large datasets and large batch sizes. Your typical batch size has 1024 examples, and each example is about 1 MB in size. The average size of a network with all weights and embeddings is 20 GB. What hardware should you choose for your models?
A cluster with 2 n1-highcpu-64 machines, each with 8 NVIDIA Tesla V100 GPUs (128 GB GPU memory in total), and a n1-highcpu-64 machine with 64 vCPUs and 58 GB RAM
A cluster with 2 a2-megagpu-16g machines, each with 16 NVIDIA Tesla A100 GPUs (640 GB GPU memory in total), 96 vCPUs, and 1.4 TB RAM
A cluster with an n1-highcpu-64 machine with a v2-8 TPU and 64 GB RAM
A cluster with 4 n1-highcpu-96 machines, each with 96 vCPUs and 86 GB RAM
ユーザの投票
コメント(8)
- 正解だと思う選択肢: B
To determine the appropriate hardware for training the models, we need to calculate the required memory and processing power based on the size of the model and the size of the input data.
Given that the batch size is 1024 and each example is 1 MB, the total size of each batch is 1024 * 1 MB = 1024 MB = 1 GB. Therefore, we need to load 1 GB of data into memory for each batch.
The total size of the network is 20 GB, which means that it can fit in the memory of most modern GPUs.
👍 3TNT872023/03/07 - 正解だと思う選択肢: B
The best hardware for your models would be a cluster with 2 a2-megagpu-16g machines, each with 16 NVIDIA Tesla A100 GPUs (640 GB GPU memory in total), 96 vCPUs, and 1.4 TB RAM.
This hardware will give you the following benefits:
High GPU memory: Each A100 GPU has 40 GB of memory, which is more than enough to store the weights and embeddings of your models. Large batch sizes: With 16 GPUs per machine, you can train your models with large batch sizes, which will improve training speed. Fast CPUs: The 96 vCPUs on each machine will provide the processing power you need to run your custom TensorFlow ops in C++. Adequate RAM: The 1.4 TB of RAM on each machine will ensure that your models have enough memory to train and run. The other options are not as suitable for your needs. Option A has less GPU memory, which will slow down training. Option B has more GPU memory, but it is also more expensive. Option C has a TPU, which is a good option for some deep learning tasks, but it is not as well-suited for your needs as a GPU cluster. Option D has more vCPUs and RAM, but it does not have enough GPU memory to train your models.
Therefore, the best hardware for your models is a cluster with 2 a2-megagpu-16g machines.
👍 3Antmal2023/04/15 - 正解だと思う選択肢: D
D CPUs are recommended for TensorFlow ops written in C++
- https://cloud.google.com/tpu/docs/tensorflow-ops (Cloud TPU only supports Python)
👍 2hiromi2022/12/21
シャッフルモード