For this notebook, I use the IMB Attrition dataset (available on Kaggle) to do the following:
- Compare the predictive performance of Logistic Regression, Random Forest, and XGBoost models using Confusions Matrices and ROC Curves
- Explore relative feature importance with respect to employee attrition for the XGBoost model using Gini Importances and Permutation_Importance
- Further explore feature importance through visualization tools PDP Plots, PDP Interaction Plots, and Shapley Waterfall Plots
- Use insights from the feature importances to perform dimensionality reduction
- Rerun models on smaller feature matrix
- The three models performed similarly on the original feature matrix, with the XGBoost slightly edging out the others
- Some important features included 'OverTime', 'NumberofCompaniesWorked', and 'BusinessTravel'
- After simplifying the model by reducing dimensionality, the simple Logistic Regression outperformed the others, which suffered from overfitting to the training data