Articles in this Volume

Research Article Open Access
A Review of the Application and Challenges of Shapley Additive Explanations in Data Analysis
Article thumbnail
The rapid growth of big data and complex machine learning models—gradient-boosted trees and deep neural networks—has produced highly accurate but opaque"black-box"predictors across medicine, finance, and industry, making interpretability a central concern in data analysis. SHapley Additive exPlanations (SHAP), grounded in cooperative game theory, has become one of the most influential interpretability methods because it provides theoretically consistent feature attributions at both the local and global levels. This paper presents a systematic literature review of SHAP and its role in data analysis. It synthesizes SHAP's theoretical foundations, its main implementations (TreeSHAP, KernelSHAP, and DeepSHAP), its visualization toolkit, and its practical applications, and it reports a compact empirical study comparing the three explainers on a clinical dataset. This study finds that, although SHAP markedly improves transparency and decision support, open challenges remain in computational cost, the reliability of explanations under feature correlation, and consistency across methods. The significance of this work is twofold: theoretically, it organizes SHAP's variants and properties within a single coherent framework; practically, it offers data analysts a structured, evidence-based reference for selecting and applying SHAP appropriately, thereby supporting more transparent, reliable, and accountable model-driven decisions.
Show more
Read Article PDF
Cite
Research Article Open Access
Machine Learning for Estimation: Comparing Tree Ensembles and Deep Learning on Tabular Data
Parametric models widely used in estimation often violate assumptions in practical research. As flexible alternatives, machine learning methods, especially tree ensembles and deep neural networks, impose fewer prior assumptions on functional forms for parameter estimation. This paper systematically examines eight estimation methods—ordinary least squares, ridge regression, lasso, random forest, XGBoost, LightGBM, multilayer perceptron (MLP), and deep neural networks—across four simulation regimes (linear, semiparametric, nonlinear, and high-dimensional sparse) and two real-world datasets: the Home Credit Default Risk dataset (307,511 samples) and the PIMA Indians Diabetes dataset (768 samples). This study evaluate model bias, mean squared error (MSE) and computational overhead to determine the applicable scenarios for each method category. Experimental results demonstrate that tree-based methods perform steadily across various scenarios. Although deep neural networks incur higher computational overhead, they achieve the minimal MSE when facing strong nonlinearity with moderate or large sample sizes. These findings provide actionable guidance for selecting estimation methods based on data characteristics, bridging theoretical advances in machine learning and practical estimation. The results verify that no single estimator outperforms all others across all data scenarios. The optimal selection relies on the joint effects of data nonlinearity, dimensionality and sample size, which highlights the necessity of diagnosis-oriented method selection in empirical studies.
Show more
Read Article PDF
Cite