On-Policy Vs. Off-Policy Reinforcement Learning in ConnectX: Seat-Stratified Performance and the Role of Action Masking

Jiayi Lai

doi:10.54254/2755-2721/2026.TJ29489

Article Information

Received: 17 October 2025

Published: 11 November 2025

DOI: https://doi.org/10.54254/2755-2721/2026.TJ29489

Article Type: Research Article

Cite this article

Lai,J. (2025). On-Policy Vs. Off-Policy Reinforcement Learning in ConnectX: Seat-Stratified Performance and the Role of Action Masking. Applied and Computational Engineering,203,160-170.

Data availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

Disclaimer/Publisher's Note

The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

About this Volume

Volume Title: ACE Vol.203

Part of Series: Applied and Computational Engineering

ISSN: 2755-2721 (Print) / 2755-273X (Online)

Copyright & License

[1]Huang, S., & Ontañón, S. (2022). A Closer Look at Invalid Action Masking in Policy Gradient Algorithms. In FLAIRS-35 Proceedings. arXiv: 2006.14171.

[2]Kaggle. (n.d.). kaggle-environments: Agent and environment library for reinforcement learning competitions. GitHub repository. Retrieved from https: //github.com/Kaggle/kaggle-environments.

[3]Allis, L. V. (1988). A Knowledge-Based Approach of Connect-Four (Master’s thesis). Vrije Universiteit Amsterdam. (Often cited with subtitle “The Game is Solved: White Wins”.)

[4]Hou, Y., Liang, X., Zhang, J., Yang, Q., Yang, A., & Wang, N. (2023). Exploring the Use of Invalid Action Masking in Reinforcement Learning: A Comparative Study of On-Policy and Off-Policy Algorithms in Real-Time Strategy Games. Applied Sciences, 13(14), 8283. https: //doi.org/10.3390/app13148283.

[5]Stable-Baselines3 Contrib Team. (n.d.). Maskable PPO (v2.5.0) — Documentation. Retrieved from https: //sb3-contrib.readthedocs.io/en/v2.5.0/modules/ppo_mask.html.

[6]Dabney, W., Rowland, M., Bellemare, M. G., & Munos, R. (2017). Distributional Reinforcement Learning with Quantile Regression. arXiv: 1710.10044.

[7]Dabney, W., Ostrovski, G., Silver, D., & Munos, R. (2018). Implicit Quantile Networks for Distributional Reinforcement Learning. In ICML 2018. arXiv: 1806.06923.

[8]Bellemare, M. G., Dabney, W., & Munos, R. (2017). A Distributional Perspective on Reinforcement Learning. Website: distributional-rl.org.

[9]Cho, T., Han, S., Lee, H., Lee, K., & Lee, J. (2023). Pitfall of Optimism: Distributional Reinforcement Learning by Randomizing Risk Criterion. NeurIPS 2023. arXiv: 2310.16546.

[10]Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv: 1707.06347.

[11]Kaggle. (n.d.). kaggle-api: Official Kaggle API. GitHub repository. Retrieved from https: //github.com/Kaggle/kaggle-api.

[12]McGrath, O., & Burke, K. (2021). Binomial Confidence Intervals for Rare Events. arXiv: 2109.02516.

[13]Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., & Kavukcuoglu, K. (2016). Asynchronous Methods for Deep Reinforcement Learning. ICML 2016. arXiv: 1602.01783.

[14]Nguyen, T. T., Gupta, S., & Venkatesh, S. (2020). Distributional Reinforcement Learning via Moment Matching. AAAI 2021. arXiv: 2007.12354.

[15]Cho, T., Han, S., Lee, H., Lee, K., & Lee, J. (2023). Pitfall of Optimism: Distributional Reinforcement Learning by Randomizing Risk Criterion. NeurIPS 2023. arXiv: 2310.16546.

[16]Garg, D., Gupta, P., Malhotra, P., Vig, L., and Shroff, G., “Batch-Constrained Distributional Reinforcement Learning for Session-based Recommendation, ” arXiv: 2012.08984 [cs.LG], 2020. DOI: 10.48550/arXiv.2012.08984.

[17]Narvekar, S., Pereira, A., Leonetti, M., & Stone, P. (2020). Curriculum Learning for Reinforcement Learning Domains: A Framework and Survey. arXiv: 2003.04960.

[18]Bighashdel, A., Wang, Y., McAleer, S., Savani, R., & Oliehoek, F. A. (2024). Policy Space Response Oracles: A Survey. IJCAI-24. arXiv: 2403.02227.

[19]McAleer, S., Lanier, J. B., Fox, R., & Baldi, P. (2021). Pipeline PSRO: A Scalable Approach for Finding Approximate Nash Equilibria in Large Games. arXiv: 2006.08555.

[20]Farama Foundation. Gymnasium API — Env: attributes (action_space, observation_space). Official documentation. Available at: https: //gymnasium.farama.org/api/env/.