Topic 1 Question 59
A data scientist has explored and sanitized a dataset in preparation for the modeling phase of a supervised learning task. The statistical dispersion can vary widely between features, sometimes by several orders of magnitude. Before moving on to the modeling phase, the data scientist wants to ensure that the prediction performance on the production data is as accurate as possible. Which sequence of steps should the data scientist take to meet these requirements?
Apply random sampling to the dataset. Then split the dataset into training, validation, and test sets.
Split the dataset into training, validation, and test sets. Then rescale the training set and apply the same scaling to the validation and test sets.
Rescale the dataset. Then split the dataset into training, validation, and test sets.
Split the dataset into training, validation, and test sets. Then rescale the training set, the validation set, and the test set independently.
解説
ユーザの投票
コメント(17)
- 正解だと思う選択肢: C
C would be my answer here. Rescaling each set independently could lead to strange skews. Training set, Test set and Evaluation set should be on the same scale
👍 17cron00012022/04/23 - 正解だと思う選択肢: B
C also leads to data leakage. You are using the test data to scale everything. So part of the data in the test set is used to scale for when you build the model on the training and check against the validation set.
👍 11masoa3b2022/10/26 - 正解だと思う選択肢: B
It is 100% B here, scaling should only be based on the training data ALONE. The testing data should be transformed by the same scaler based on the training data. C would introduce information from the testing data in the process of scaling the training data, therefore introducing information leakage, and that can lead to the model picking up information from the test data.
👍 5Siyuan_Zhu2023/02/21
シャッフルモード