Topic 1 Question 158
A data scientist is training a text classification model by using the Amazon SageMaker built-in BlazingText algorithm. There are 5 classes in the dataset, with 300 samples for category A, 292 samples for category B, 240 samples for category C, 258 samples for category D, and 310 samples for category E. The data scientist shuffles the data and splits off 10% for testing. After training the model, the data scientist generates confusion matrices for the training and test sets.
What could the data scientist conclude form these results?Classes C and D are too similar.
The dataset is too small for holdout cross-validation.
The data distribution is skewed.
The model is overfitting for classes B and E.
ユーザの投票
コメント(6)
Isn't it A? the model doesn't classify C & D well.
👍 7LydiaGom2022/05/11- 正解だと思う選択肢: A
the correct answer should be A, the model is clearly unable to tell C and D apart
the reason why B is incorrect is subtle - there is holdout validation or cross-validation, but not holdout cross-validation; while I think it would be more reasonable to use CV with such a small dataset rather than holdout, the answer is mixing terms and therefore should be wrong
also, the test set confusion matrix is still pretty comparable to the train set one, so I wouldn't say there is objective evidence to claim holdout is a wrong choice here
👍 5dolorez2022/05/24 I think the answer is A. The model doesn't perform well on class C and D in both training and testing dataset. I don't think B is relevant to the question(cross-validation is not mentioned in the question)
👍 3tgaos2022/05/28
シャッフルモード