Volume 252 | Applied and Computational Engineering

The rapid growth of big data and complex machine learning models—gradient-boosted trees and deep neural networks—has produced highly accurate but opaque"black-box"predictors across medicine, finance, and industry, making interpretability a central concern in data analysis. SHapley Additive exPlanations (SHAP), grounded in cooperative game theory, has become one of the most influential interpretability methods because it provides theoretically consistent feature attributions at both the local and global levels. This paper presents a systematic literature review of SHAP and its role in data analysis. It synthesizes SHAP's theoretical foundations, its main implementations (TreeSHAP, KernelSHAP, and DeepSHAP), its visualization toolkit, and its practical applications, and it reports a compact empirical study comparing the three explainers on a clinical dataset. This study finds that, although SHAP markedly improves transparency and decision support, open challenges remain in computational cost, the reliability of explanations under feature correlation, and consistency across methods. The significance of this work is twofold: theoretically, it organizes SHAP's variants and properties within a single coherent framework; practically, it offers data analysts a structured, evidence-based reference for selecting and applying SHAP appropriately, thereby supporting more transparent, reliable, and accountable model-driven decisions.

Parametric models widely used in estimation often violate assumptions in practical research. As flexible alternatives, machine learning methods, especially tree ensembles and deep neural networks, impose fewer prior assumptions on functional forms for parameter estimation. This paper systematically examines eight estimation methods—ordinary least squares, ridge regression, lasso, random forest, XGBoost, LightGBM, multilayer perceptron (MLP), and deep neural networks—across four simulation regimes (linear, semiparametric, nonlinear, and high-dimensional sparse) and two real-world datasets: the Home Credit Default Risk dataset (307,511 samples) and the PIMA Indians Diabetes dataset (768 samples). This study evaluate model bias, mean squared error (MSE) and computational overhead to determine the applicable scenarios for each method category. Experimental results demonstrate that tree-based methods perform steadily across various scenarios. Although deep neural networks incur higher computational overhead, they achieve the minimal MSE when facing strong nonlinearity with moderate or large sample sizes. These findings provide actionable guidance for selecting estimation methods based on data characteristics, bridging theoretical advances in machine learning and practical estimation. The results verify that no single estimator outperforms all others across all data scenarios. The optimal selection relies on the joint effects of data nonlinearity, dimensionality and sample size, which highlights the necessity of diagnosis-oriented method selection in empirical studies.

Articles in this Volume