Let A is a known set of actions. Ra is a distribution of rewards, given action a. At a timestep t, an agent selects an action a and gets a reward Rt ~ Ra. The goal is to maximize the cumulative rewards.