If (YOU == MY MOM): read / Else: skip to Markov Land
It’s Thursday, and I just spent the past three days watching lectures by a brilliant mind named David Silver (the man behind AlphaGo), scratching the surface of what is called Reinforcement Learning (RL). You’re going to be hearing a lot of these (insert word) Learning words as we go on: Machine Learning, Deep Learning, Reinforcement Learning, etc. You’ll get to see that machines “learn” in ways that are sometimes similar to and often quite different from how we humans learn things. I put “learn” in quotations because I think there is a much more interesting philosophical meaning behind what constitutes learning versus simple memorization or pattern recognition. More on that some other time…
Today’s post, which as the title suggests, is going to be about how to make an agent play the game rock paper scissors (RPS). In fact, I’m not going to motivate this discussion with how powerful this RL stuff is because I think plenty of far-more-qualified people talk about it all the time, and two, let’s be honest, you’re reading this because you want to be sure I’m actually working on something and not just bathing in the California sun. Occasionally, mom, I do.
The link to Reinforcement Learning that you probably didn’t click up there is to a Wikipedia page that basically says RL is the study of trying to make computers (software agents) know what to do (take an action) in an environment (state) with the goal of doing well (maximizing rewards). What does that mean, exactly? Well, I probably told you about one of my professors at Harvard–Sasha Rush–who frequently uses the phrase, “let’s be more explicit here”. The word used to mean something very different for me. But alas, I think he has a point, so let me do just that–be explicit.
There was this Russian genius mathematician called Andrei Markov (1856-1922) and I knew neither his first name nor his life story, but one of the things that he came up with was the idea of a Markovian Property, which says that the conditional probability distribution of future states are only dependent upon the present state and not on any past states, or as I like to think of it, everything I need to know, I know. This is a powerful assumption! Imagine if I was sprawled across on the couch and you asked me if I ate a late night snack. I would say no, and you might say, ok. In a world that is non-Markovian, you would have no way of knowing if I was lying or telling the truth. You would need more information about what I was doing 5, 10, 20 minutes ago to know if I was telling the truth. But in a Markovian world, where you presumably know everything you need to know to make your next move (scold me for lying or come and join me in watching Mad Men), you would be able to observe that ramen noodle sitting on my shirt. Without needing to look back into time, you were given all the information you needed from my present state. That is more or less what people mean when they say Markov Property.
We’re going to examine a few more words with Markov in it. Markov Process refers to a stochastic process that has the Markov Property. Stochastic is a fancy word for saying random (e.g. variables can be random, processes can be stochastic). So basically given a state in the environment, say I’m lying in bed, claiming I’m meditating, there is a non-deterministic/random process of me falling asleep, just staying awake, and actually meditating. These can be represented with probabilities: 95%, 4%, and 1%, respectively. Naturally, they can be written as <S, P>, where S stands for all possible states and P stands for the transition probability, which can be thought of as a table where the rows have start states, columns denoting end states, and each entry being a probability.
Markov Reward Process (MRP)
I know it’s getting dry, but this is the perfect place to introduce an incentive or reward, right? Markov Reward Process adds to the Markov Process two things: the reward function, R, and a discount factor, γ (read gamma, it’s Greek). Say I’m lying in bed, and I fall asleep. That makes me happy, so I get a reward of 1. If I just stay there awake drawing a blank, I’m neither here nor there, so I get a reward of 0. If I actually meditate, at first it will probably make me think, “I could be sleeping…” so I’d get a reward of -2 right now. But say 5 minutes into it, I feel pretty good, getting a reward of +1, and once I hit that zen mode 10 minutes later, I get a reward of +5! R is the function that describes these rewards for each transition in state. Then what is γ? It’s a number between 0 and 1 (inclusive) that tells me how myopic vs. far-sighted I want to be. If it’s 0, it means I only care about immediate rewards, and if it’s 1, it means I weigh all future rewards the same. Let’s draw this out.
Each of the round circles denote a state S, the reds show the immediate reward R I get by being blown a certain way by the magical wind of the transition probabilities P, which are the black numbers. The blue is me calculating the total reward each state promises. Falling asleep and drawing a blank are straightforward: 1 and 0 immediate rewards. The “Ughhh” state is a combination of the -2 immediate reward and discounted +1 and +5 rewards. Notice the farther we go into the future, it often entails higher uncertainty, so we discounted it more (if there was super-duper zen after full-zen, it would be discounted by gamma cubed). If γ is 0, I don’t care about the future, so I just view “Ughhh” as being -2 in value. If I consider all futures however far away they are all the same, you find me getting a +4 in value! Realistically, I may discount a bit, say by 70% (0.7), which yields +1.15. The MRP <S, P, R, γ> defines this whole world dynamic.
Markov Decision Process (MDP)
So far, I let the wind billow, and off I went to these different states with some probability. But that’s not satisfying now, is it? We are hoomans! Instruments of change, our agency defining us! We must take action! That brings us to the final part of this Markov journey: Markov Decision Processes. Thankfully, we already did most of the work! MDPs simply add a layer of action on top of MRPs. That’s right, we just add an A for action, and we’re done. The MDP <S, A, P, R, γ> completely defines a world where now I can take actions.
The blue arrows and circles now show the action I am able to take, while the red denotes reward and the rest of the black arrows and circles denote the transition probabilities and states as before. Notice how even when I take a particular action, such as closing my eyes, the environment (my sweet, sweet bed) may tip me over to falling asleep 80% of the time. But sometimes, like when I focus in meditating, I am 100% surely going to reach zen. At least that’s how this Markovian world works
Phew, ok writing these is a lot more engaging than I thought! But also quite fun 🙂 So now you know what it means for a world to have a Markov Property, and you built on that the Markov Process, Markov Reward Process (MRP), and Markov Decision Process (MDP). A cool thing to note is that given a particular policy (we’ll write it as π, Greek letter, read pie), any MDP can be written as an MRP! Which makes sense, because if you have a policy (strategy) to act a certain way whenever you’re at a particular state, then it’s the same as skipping that extra blue bubble and going right ahead to the next state.
I intended to cover a lot more ground, but I think this is quite enough for one sitting. Next time, I’ll talk about another famous guy named Bellman, about action-value and state-value (they mean pretty much what they sound like), and see how these concepts work toward an idea called Q-Learning, which is what I used to code up an agent that is able to learn how to play RPS from scratch and find an optimal strategy.
That’s it for today!