Topic 1 Question 34
A Data Scientist is developing a machine learning model to predict future patient outcomes based on information collected about each patient and their treatment plans. The model should output a continuous value as its prediction. The data available includes labeled outcomes for a set of 4,000 patients. The study was conducted on a group of individuals over the age of 65 who have a particular disease that is known to worsen with age. Initial models have performed poorly. While reviewing the underlying data, the Data Scientist notices that, out of 4,000 patient observations, there are 450 where the patient age has been input as 0. The other features for these observations appear normal compared to the rest of the sample population How should the Data Scientist correct this issue?
Drop all records from the dataset where age has been set to 0.
Replace the age field value for records with a value of 0 with the mean or median value from the dataset
Drop the age feature from the dataset and train the model using the rest of the features.
Use k-means clustering to handle missing features
ユーザの投票
コメント(17)
Dropping the Age feature is a NOT ATOLL a good idea - as age plays a critical role in this disease as per the question
Dropping 10% of data is NOT a good idea considering the fact that the number of observations is already low.
The Mean or Median are a potential solutions
But the question says that "Disease worsens after age 65 so there is a correlation between age and other symptoms related feature" So that means that using Unsupervised Learning we can make pretty good prediction of "Age"
So the answer is D Use K-Means clustering
👍 28rajs2021/10/02Replacing the age with mean or median might bring a bias to the dataset. Use k-means clustering to estimate the missing age based on other features might get better results. Removing 10% available data looks odd. Why not D?
👍 18vetal2021/09/19- 正解だと思う選択肢: D
The following is an excerpt from a Springer Verlag book called Contemporary Computing.
"...We propose an efficient missing value imputation method based on clustering with weighted distance. We divide the data set into clusters based on user specified value K. Then find a complete valued neighbor which is nearest to the missing valued instance. Then we compute the missing value by taking the average of the centroid value and the centroidal distance of the neighbor. This value is used as impute value. In our proposed approach we use K-means technique with weighted distance and show that our approach results in better performance..."
👍 7passionatecricketer2021/11/21
シャッフルモード