Modified Index Policies for Multi-Armed Bandits with Network-like Markovian Dependencies

Sawwan, Abdalaziz; Wu, Jie

doi:10.3390/network5010003

This is an early access version, the complete PDF, HTML, and XML versions will be available soon.

Open AccessArticle

Modified Index Policies for Multi-Armed Bandits with Network-like Markovian Dependencies

by

Abdalaziz Sawwan

^* and

Jie Wu

Department of Computer and Information Science, Temple University, Philadelphia, PA 19122, USA

^*

Author to whom correspondence should be addressed.

Network 2025, 5(1), 3; https://doi.org/10.3390/network5010003

Submission received: 29 October 2024 / Revised: 15 January 2025 / Accepted: 26 January 2025 / Published: 29 January 2025

Download Versions Notes

Abstract

Sequential decision-making in dynamic and interconnected environments is a cornerstone of numerous applications, ranging from communication networks and finance to distributed blockchain systems and IoT frameworks. The multi-armed bandit (MAB) problem is a fundamental model in this domain that traditionally assumes independent and identically distributed (iid) rewards, which limits its effectiveness in capturing the inherent dependencies and state dynamics present in some real-world scenarios. In this paper, we lay a theoretical framework for a modified MAB model in which each arm’s reward is generated by a hidden Markov process. In our model, each arm undergoes Markov state transitions independent of play in a way that results in varying reward distributions and heightened uncertainty in reward observations. The number of states for each arm can be up to three states. A key challenge arises from the fact that the underlying states governing each arm’s rewards remain hidden at the time of selection. To address this, we adapt traditional index-based policies and develop a modified index approach tailored to accommodate Markovian transitions and enhance selection efficiency for our model. Our proposed proposed Markovian Upper Confidence Bound (MC-UCB) policy achieves logarithmic regret. Comparative analysis with the classical UCB algorithm reveals that MC-UCB consistently achieves approximately a 15% reduction in cumulative regret. This work provides significant theoretical insights and lays a robust foundation for future research aimed at optimizing decision-making processes in complex, networked systems with hidden state dependencies.

Keywords: dynamic distributions; learning theory; Markov chain; multi-armed bandit

Share and Cite

MDPI and ACS Style

Sawwan, A.; Wu, J. Modified Index Policies for Multi-Armed Bandits with Network-like Markovian Dependencies. Network 2025, 5, 3. https://doi.org/10.3390/network5010003

AMA Style

Sawwan A, Wu J. Modified Index Policies for Multi-Armed Bandits with Network-like Markovian Dependencies. Network. 2025; 5(1):3. https://doi.org/10.3390/network5010003

Chicago/Turabian Style

Sawwan, Abdalaziz, and Jie Wu. 2025. "Modified Index Policies for Multi-Armed Bandits with Network-like Markovian Dependencies" Network 5, no. 1: 3. https://doi.org/10.3390/network5010003

APA Style

Sawwan, A., & Wu, J. (2025). Modified Index Policies for Multi-Armed Bandits with Network-like Markovian Dependencies. Network, 5(1), 3. https://doi.org/10.3390/network5010003

Article Menu

Modified Index Policies for Multi-Armed Bandits with Network-like Markovian Dependencies

Abstract

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI