Topic 1 Question 225

AWS Certified Machine Learning - Specialty

Topic 1 Question 225
A data scientist is training a large PyTorch model by using Amazon SageMaker. It takes 10 hours on average to train the model on GPU instances. The data scientist suspects that training is not converging and that resource utilization is not optimal.

What should the data scientist do to identify and address training issues with the LEAST development effort?
- Use CPU utilization metrics that are captured in Amazon CloudWatch. Configure a CloudWatch alarm to stop the training job early if low CPU utilization occurs.
- Use high-resolution custom metrics that are captured in Amazon CloudWatch. Configure an AWS Lambda function to analyze the metrics and to stop the training job early if issues are detected.
- Use the SageMaker Debugger vanishing_gradient and LowGPUUtilization built-in rules to detect issues and to launch the StopTrainingJob action if issues are detected.
- Use the SageMaker Debugger confusion and feature_importance_overweight built-in rules to detect issues and to launch the StopTrainingJob action if issues are detected.
ユーザの投票
コメント(4)
- It has to be C.
  
  👍 5
  Amit110119962023/02/06
- 正解だと思う選択肢: C
  https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html
  
  👍 4
  Jerry842023/02/14
- 正解だと思う選択肢: C
  C. Use the SageMaker Debugger vanishing_gradient and LowGPUUtilization built-in rules to detect issues and to launch the StopTrainingJob action if issues are detected.
  
  The SageMaker Debugger is a built-in tool that helps with debugging and profiling machine learning models trained in SageMaker. In this scenario, the data scientist suspects that there are issues with the training process, so using the SageMaker Debugger is the most appropriate solution. The vanishing_gradient and LowGPUUtilization built-in rules can detect common training issues such as a vanishing gradient problem or low GPU utilization, which could affect the training convergence and resource utilization. By launching the StopTrainingJob action if issues are detected, the training job can be stopped early, which can help to save resources and time. This approach requires the least development effort, as it is built-in to SageMaker and does not require the data scientist to create custom metrics or configure CloudWatch alarms.
  
  👍 3
  AjoseO2023/02/20
シャッフルモード

ユーザの投票

コメント(4)