Mom Learns Tech (1): Rock Paper Scissors pt. 1 Markov

If (YOU == MY MOM): read / Else: skip to Markov Land

It’s Thursday, and I just spent the past three days watching lectures by a brilliant mind named David Silver (the man behind AlphaGo), scratching the surface of what is called Reinforcement Learning (RL). You’re going to be hearing a lot of these (insert word) Learning words as we go on: Machine Learning, Deep Learning, Reinforcement Learning, etc. You’ll get to see that machines “learn” in ways that are sometimes similar to and often quite different from how we humans learn things. I put “learn” in quotations because I think there is a much more interesting philosophical meaning behind what constitutes learning versus simple memorization or pattern recognition. More on that some other time…

Today’s post, which as the title suggests, is going to be about how to make an agent play the game rock paper scissors (RPS). In fact, I’m not going to motivate this discussion with how powerful this RL stuff is because I think plenty of far-more-qualified people talk about it all the time, and two, let’s be honest, you’re reading this because you want to be sure I’m actually working on something and not just bathing in the California sun. Occasionally, mom, I do.

Markov Land

The link to Reinforcement Learning that you probably didn’t click up there is to a Wikipedia page that basically says RL is the study of trying to make computers (software agents) know what to do (take an action) in an environment (state) with the goal of doing well (maximizing rewards). What does that mean, exactly? Well, I probably told you about one of my professors at Harvard–Sasha Rush–who frequently uses the phrase, “let’s be more explicit here”. The word used to mean something very different for me. But alas, I think he has a point, so let me do just that–be explicit.

Markov Property

There was this Russian genius mathematician called Andrei Markov (1856-1922) and I knew neither his first name nor his life story, but one of the things that he came up with was the idea of a Markovian Property, which says that the conditional probability distribution of future states are only dependent upon the present state and not on any past states, or as I like to think of it, everything I need to know, I know. This is a powerful assumption! Imagine if I was sprawled across on the couch and you asked me if I ate a late night snack. I would say no, and you might say, ok. In a world that is non-Markovian, you would have no way of knowing if I was lying or telling the truth. You would need more information about what I was doing 5, 10, 20 minutes ago to know if I was telling the truth. But in a Markovian world, where you presumably know everything you need to know to make your next move (scold me for lying or come and join me in watching Mad Men), you would be able to observe that ramen noodle sitting on my shirt. Without needing to look back into time, you were given all the information you needed from my present state. That is more or less what people mean when they say Markov Property.

Markov Process

We’re going to examine a few more words with Markov in it. Markov Process refers to a stochastic process that has the Markov Property. Stochastic is a fancy word for saying random (e.g. variables can be random, processes can be stochastic). So basically given a state in the environment, say I’m lying in bed, claiming I’m meditating, there is a non-deterministic/random process of me falling asleep, just staying awake, and actually meditating. These can be represented with probabilities: 95%, 4%, and 1%, respectively. Naturally, they can be written as <S, P>, where stands for all possible states and stands for the transition probability, which can be thought of as a table where the rows have start states, columns denoting end states, and each entry being a probability.

Markov Reward Process (MRP)

I know it’s getting dry, but this is the perfect place to introduce an incentive or reward, right? Markov Reward Process adds to the Markov Process two things: the reward function, R, and a discount factor, γ (read gamma, it’s Greek). Say I’m lying in bed, and I fall asleep. That makes me happy, so I get a reward of 1. If I just stay there awake drawing a blank, I’m neither here nor there, so I get a reward of 0. If I actually meditate, at first it will probably make me think, “I could be sleeping…” so I’d get a reward of -2 right now. But say 5 minutes into it, I feel pretty good, getting a reward of +1, and once I hit that zen mode 10 minutes later, I get a reward of +5! is the function that describes these rewards for each transition in state. Then what is γ? It’s a number between 0 and 1 (inclusive) that tells me how myopic vs. far-sighted I want to be. If it’s 0, it means I only care about immediate rewards, and if it’s 1, it means I weigh all future rewards the same. Let’s draw this out.


Each of the round circles denote a state S, the reds show the immediate reward R I get by being blown a certain way by the magical wind of the transition probabilities P, which are the black numbers. The blue is me calculating the total reward each state promises. Falling asleep and drawing a blank are straightforward: 1 and 0 immediate rewards. The “Ughhh” state is a combination of the -2 immediate reward and discounted +1 and +5 rewards. Notice the farther we go into the future, it often entails higher uncertainty, so we discounted it more (if there was super-duper zen after full-zen, it would be discounted by gamma cubed). If γ is 0, I don’t care about the future, so I just view “Ughhh” as being -2 in value. If I consider all futures however far away they are all the same, you find me getting a +4 in value! Realistically, I may discount a bit, say by 70% (0.7), which yields +1.15. The MRP <S, P, R, γ> defines this whole world dynamic.

Markov Decision Process (MDP)

So far, I let the wind billow, and off I went to these different states with some probability. But that’s not satisfying now, is it? We are hoomans! Instruments of change, our agency defining us! We must take action! That brings us to the final part of this Markov journey: Markov Decision Processes. Thankfully, we already did most of the work! MDPs simply add a layer of action on top of MRPs. That’s right, we just add an for action, and we’re done. The MDP <S, A, P, R, γ> completely defines a world where now I can take actions.


The blue arrows and circles now show the action I am able to take, while the red denotes reward and the rest of the black arrows and circles denote the transition probabilities and states as before. Notice how even when I take a particular action, such as closing my eyes, the environment (my sweet, sweet bed) may tip me over to falling asleep 80% of the time. But sometimes, like when I focus in meditating, I am 100% surely going to reach zen. At least that’s how this Markovian world works :-/


Phew, ok writing these is a lot more engaging than I thought! But also quite fun 🙂 So now you know what it means for a world to have a Markov Property, and you built on that the Markov Process, Markov Reward Process (MRP), and Markov Decision Process (MDP). A cool thing to note is that given a particular policy (we’ll write it as π, Greek letter, read pie), any MDP can be written as an MRP! Which makes sense, because if you have a policy (strategy) to act a certain way whenever you’re at a particular state, then it’s the same as skipping that extra blue bubble and going right ahead to the next state.

I intended to cover a lot more ground, but I think this is quite enough for one sitting. Next time, I’ll talk about another famous guy named Bellman, about action-value and state-value (they mean pretty much what they sound like), and see how these concepts work toward an idea called Q-Learning, which is what I used to code up an agent that is able to learn how to play RPS from scratch and find an optimal strategy.

That’s it for today!

Love you

Mom Learns Tech (0): The Intro

I spent 2 years of my life digging foxholes and writing the occasional ‘fun-sized’ algorithms on pieces of scrap paper. These pieces of paper were soggy yet somehow simultaneously crusty, so I either rolled them up or folded them in a ziplock so they wouldn’t get wet. After all that, I knew I wanted a proper academic challenge. But coming from a military setting back to school was tough. Learning about statistics from Joe Blitzstein and machine learning from Sasha Rush was like going through growing pains all over again (unfortunately, I never really saw any growth). I began shoving concepts and ideas into my brain, and by the end of the year, I was able to do a thing or two. Not perfectly, but reasonably well.

Being Korean and having lived overseas for half my life, I established a semi-regular routine of calling my parents who live in Seoul.  The conversations are mostly about how well I’m holding up academically, how horribly I’m screwing up financially, and sometimes how I’m no longer the social butterfly I used to be. And vitamins. Gotta take those vitamins.

But in the end, unlike my years in high school, the scope of our conversations became narrower each year. At the heart of it was my inability to communicate what I was spending most of my time doing: studying computer science.

I’m not an AI fanatic. But I do think that harnessed properly, there are some very interesting problems that AI-backed methods allow us to solve. I grew up in a family where everyone viewed the world by weighing the value of things on judiciary notes and the constitution, never marveling at Moore’s Law or the latest GPU liquid cooler. Still, I have a passion for languages and how they encapsulate history, culture, and a latent meaning that today’s computational power may help us rediscover. I am fascinated by the way people interact via text, on the phone, by body language, and believe that there is a way to decode the way that we humans continue to create and recreate social behaviors.

Most importantly, I love my mom. I think she is the most wonderful woman in this world (with some fatally cute flaws), and I felt ridiculous needing to fall silent whenever I wanted to explain to her why I decided to use TD instead of MC for a particular RL task, or why I’m starting to think Deep Learning is a hoax and we should all just use XGBoost.

So I decided, once a week or so, I’ll write a post and share it with my mom who has close to no ML background, some statistics background, and a lot of life. Will I be clear and correct in explaining everything? Absolutely not. If I were, I’d be writing a textbook, not a blog post for me and my mom (and the occasional reader). But any constructive criticism is welcome, and I hope to keep this up for enough weeks that by the time I’m back home for Christmas, she and I can open a bottle of wine and talk about why Attention Is Not All You Need.

P.S. Love you dad! But let’s be honest, you’re not really into this ‘stuff’ 🙂

Blog at

Up ↑