Mom Learns Tech (1): Rock Paper Scissors pt. 1 Markov

If (YOU == MY MOM): read / Else: skip to Markov Land

It’s Thursday, and I just spent the past three days watching lectures by a brilliant mind named David Silver (the man behind AlphaGo), scratching the surface of what is called Reinforcement Learning (RL). You’re going to be hearing a lot of these (insert word) Learning words as we go on: Machine Learning, Deep Learning, Reinforcement Learning, etc. You’ll get to see that machines “learn” in ways that are sometimes similar to and often quite different from how we humans learn things. I put “learn” in quotations because I think there is a much more interesting philosophical meaning behind what constitutes learning versus simple memorization or pattern recognition. More on that some other time…

Today’s post, which as the title suggests, is going to be about how to make an agent play the game rock paper scissors (RPS). In fact, I’m not going to motivate this discussion with how powerful this RL stuff is because I think plenty of far-more-qualified people talk about it all the time, and two, let’s be honest, you’re reading this because you want to be sure I’m actually working on something and not just bathing in the California sun. Occasionally, mom, I do.

Markov Land

The link to Reinforcement Learning that you probably didn’t click up there is to a Wikipedia page that basically says RL is the study of trying to make computers (software agents) know what to do (take an action) in an environment (state) with the goal of doing well (maximizing rewards). What does that mean, exactly? Well, I probably told you about one of my professors at Harvard–Sasha Rush–who frequently uses the phrase, “let’s be more explicit here”. The word used to mean something very different for me. But alas, I think he has a point, so let me do just that–be explicit.

Markov Property

There was this Russian genius mathematician called Andrei Markov (1856-1922) and I knew neither his first name nor his life story, but one of the things that he came up with was the idea of a Markovian Property, which says that the conditional probability distribution of future states are only dependent upon the present state and not on any past states, or as I like to think of it, everything I need to know, I know. This is a powerful assumption! Imagine if I was sprawled across on the couch and you asked me if I ate a late night snack. I would say no, and you might say, ok. In a world that is non-Markovian, you would have no way of knowing if I was lying or telling the truth. You would need more information about what I was doing 5, 10, 20 minutes ago to know if I was telling the truth. But in a Markovian world, where you presumably know everything you need to know to make your next move (scold me for lying or come and join me in watching Mad Men), you would be able to observe that ramen noodle sitting on my shirt. Without needing to look back into time, you were given all the information you needed from my present state. That is more or less what people mean when they say Markov Property.

Markov Process

We’re going to examine a few more words with Markov in it. Markov Process refers to a stochastic process that has the Markov Property. Stochastic is a fancy word for saying random (e.g. variables can be random, processes can be stochastic). So basically given a state in the environment, say I’m lying in bed, claiming I’m meditating, there is a non-deterministic/random process of me falling asleep, just staying awake, and actually meditating. These can be represented with probabilities: 95%, 4%, and 1%, respectively. Naturally, they can be written as <S, P>, where S stands for all possible states and P stands for the transition probability, which can be thought of as a table where the rows have start states, columns denoting end states, and each entry being a probability.

Markov Reward Process (MRP)

I know it’s getting dry, but this is the perfect place to introduce an incentive or reward, right? Markov Reward Process adds to the Markov Process two things: the reward function, R, and a discount factor, γ (read gamma, it’s Greek). Say I’m lying in bed, and I fall asleep. That makes me happy, so I get a reward of 1. If I just stay there awake drawing a blank, I’m neither here nor there, so I get a reward of 0. If I actually meditate, at first it will probably make me think, “I could be sleeping…” so I’d get a reward of -2 right now. But say 5 minutes into it, I feel pretty good, getting a reward of +1, and once I hit that zen mode 10 minutes later, I get a reward of +5! R is the function that describes these rewards for each transition in state. Then what is γ? It’s a number between 0 and 1 (inclusive) that tells me how myopic vs. far-sighted I want to be. If it’s 0, it means I only care about immediate rewards, and if it’s 1, it means I weigh all future rewards the same. Let’s draw this out.

Each of the round circles denote a state S, the reds show the immediate reward R I get by being blown a certain way by the magical wind of the transition probabilities P, which are the black numbers. The blue is me calculating the total reward each state promises. Falling asleep and drawing a blank are straightforward: 1 and 0 immediate rewards. The “Ughhh” state is a combination of the -2 immediate reward and discounted +1 and +5 rewards. Notice the farther we go into the future, it often entails higher uncertainty, so we discounted it more (if there was super-duper zen after full-zen, it would be discounted by gamma cubed). If γ is 0, I don’t care about the future, so I just view “Ughhh” as being -2 in value. If I consider all futures however far away they are all the same, you find me getting a +4 in value! Realistically, I may discount a bit, say by 70% (0.7), which yields +1.15. The MRP <S, P, R, γ> defines this whole world dynamic.

Markov Decision Process (MDP)

So far, I let the wind billow, and off I went to these different states with some probability. But that’s not satisfying now, is it? We are hoomans! Instruments of change, our agency defining us! We must take action! That brings us to the final part of this Markov journey: Markov Decision Processes. Thankfully, we already did most of the work! MDPs simply add a layer of action on top of MRPs. That’s right, we just add an A for action, and we’re done. The MDP <S, A, P, R, γ> completely defines a world where now I can take actions.

The blue arrows and circles now show the action I am able to take, while the red denotes reward and the rest of the black arrows and circles denote the transition probabilities and states as before. Notice how even when I take a particular action, such as closing my eyes, the environment (my sweet, sweet bed) may tip me over to falling asleep 80% of the time. But sometimes, like when I focus in meditating, I am 100% surely going to reach zen. At least that’s how this Markovian world works

Recap

Phew, ok writing these is a lot more engaging than I thought! But also quite fun 🙂 So now you know what it means for a world to have a Markov Property, and you built on that the Markov Process, Markov Reward Process (MRP), and Markov Decision Process (MDP). A cool thing to note is that given a particular policy (we’ll write it as π, Greek letter, read pie), any MDP can be written as an MRP! Which makes sense, because if you have a policy (strategy) to act a certain way whenever you’re at a particular state, then it’s the same as skipping that extra blue bubble and going right ahead to the next state.

I intended to cover a lot more ground, but I think this is quite enough for one sitting. Next time, I’ll talk about another famous guy named Bellman, about action-value and state-value (they mean pretty much what they sound like), and see how these concepts work toward an idea called Q-Learning, which is what I used to code up an agent that is able to learn how to play RPS from scratch and find an optimal strategy.

That’s it for today!

Love you

July 13, 2018 0

Mom Learns Tech (0): The Intro

I spent 2 years of my life digging foxholes and writing the occasional ‘fun-sized’ algorithms on pieces of scrap paper. These pieces of paper were soggy yet somehow simultaneously crusty, so I either rolled them up or folded them in a ziplock so they wouldn’t get wet. After all that, I knew I wanted a proper academic challenge. But coming from a military setting back to school was tough. Learning about statistics from Joe Blitzstein and machine learning from Sasha Rush was like going through growing pains all over again (unfortunately, I never really saw any growth). I began shoving concepts and ideas into my brain, and by the end of the year, I was able to do a thing or two. Not perfectly, but reasonably well.

Being Korean and having lived overseas for half my life, I established a semi-regular routine of calling my parents who live in Seoul. The conversations are mostly about how well I’m holding up academically, how horribly I’m screwing up financially, and sometimes how I’m no longer the social butterfly I used to be. And vitamins. Gotta take those vitamins.

But in the end, unlike my years in high school, the scope of our conversations became narrower each year. At the heart of it was my inability to communicate what I was spending most of my time doing: studying computer science.

I’m not an AI fanatic. But I do think that harnessed properly, there are some very interesting problems that AI-backed methods allow us to solve. I grew up in a family where everyone viewed the world by weighing the value of things on judiciary notes and the constitution, never marveling at Moore’s Law or the latest GPU liquid cooler. Still, I have a passion for languages and how they encapsulate history, culture, and a latent meaning that today’s computational power may help us rediscover. I am fascinated by the way people interact via text, on the phone, by body language, and believe that there is a way to decode the way that we humans continue to create and recreate social behaviors.

Most importantly, I love my mom. I think she is the most wonderful woman in this world (with some fatally cute flaws), and I felt ridiculous needing to fall silent whenever I wanted to explain to her why I decided to use TD instead of MC for a particular RL task, or why I’m starting to think Deep Learning is a hoax and we should all just use XGBoost.

So I decided, once a week or so, I’ll write a post and share it with my mom who has close to no ML background, some statistics background, and a lot of life. Will I be clear and correct in explaining everything? Absolutely not. If I were, I’d be writing a textbook, not a blog post for me and my mom (and the occasional reader). But any constructive criticism is welcome, and I hope to keep this up for enough weeks that by the time I’m back home for Christmas, she and I can open a bottle of wine and talk about why Attention Is Not All You Need.

P.S. Love you dad! But let’s be honest, you’re not really into this ‘stuff’ 🙂

July 10, 2018 0

Deployment in Lebanon: a postmortem (KOR)

Wrapping up Deployment in Lebanon

우리에게 새로운 시작을 한다는 것은 두려움과 기대감의 교차를 의미한다. 그 오묘한 감정의 조화 때문에 우리는 긴장의 끈을 단단히 붙잡고 사주경계를 철저히 하며 앞으로 조심스럽지만 힘차게 나아간다. 당황스럽거나 예상치 못한 일이 일어나도 재빨리 대처한다. 사명감보다 가볍고 견고한 전투복이 어디 있으랴, 매사에 혼을 쏟아 자기 자신의 일부를 펄펄 끓는 열정의 용광로에 부어넣는 것이 새로운 도전을 받아들이는 최소한의 자세, 예의이자 마땅히 자신과의 약속이라고 생각한다.

동명부대 18진의 329명, 직책과 계급을 막론하고 우리는 모두 ‘선발자원’이라는 영예를 자랑스러운 무게로 생각하고, 공동의 목적을 달성하기 위해 각자의 위치에서 고정감시초소 근무에서부터 식당청소까지 다양한 역할을 수행해왔다. 부대 전개 이후 첫 몇 달간은 신혼처럼 즐거움이 베어났다. 누구나 만사에 감사했고, 모두에게 친절했으며, 사소한 갈등은 오히려 더 현명하게 대처할 수 있는 능력을 뽐낼 기회로 다가왔다. 상당수 간부들은 너도나도 유엔모(帽)에 ‘저는 지금 금연 중입니다’ 배지를 무궁화훈장처럼 뽐냈고, 용사들은 아이돌의 찰랑거리는 머리보다, 짧고 단정한 머리를 한없이 멋스럽게 보이게 하는 작전대대장님을 따라하려 너도나도 머리를 밀다가, 이윽고 거울에 비추어진 웬 이름모를 해산물의 모습 속에서 군 생활 20년과 20개월의 가랑이 찢어지는 차이를 실감하기도 했다. 하지만 자연스레 같은 일상의 반복에 나태함이라는 녹이 조금씩 여기저기에 스며들기 시작했다. 갈등의 매듭은 시간이 지날수록 더 복잡하게 얽히어 갔고, 대화와 소통은 때로는 부족하기도, 심지어는 단절되기도 했다. 하나로 시작된 동명부대는 ‘우리’와 ‘쟤네’, ‘나’와 ‘너’로 나눠지기 시작한 것이다.

파병의 가장 힘든 점은 업무도 작전도 아닌 사람이다. 귀에 못이 박힐 정도로 자주 들은 말이지만 결코 틀리지 않았기에 작대기부터 대나무꽃까지 이구동성으로 이야기 한다. 이른 아침 U자 배관이 없는 소변기의 쩌렁찌릿 암모니아 냄새와 같은 시설의 불편함은 감수하면 되고, 계절도 휴일도 밤낮도 없이 찾아오는 we go together 무장사냥꾼들에 의한 업무와 경계의 격상은 극복하면 그만이다. 하지만 타인과 함께 같은 울타리 안에서 생활하며 빚었던 갈등을 덮고 또 덮어, 그 상처가 썩어 번질 때까지 계속되는 고조된 긴장감의 줄타기는 피차 감정노동이며 주변인들에게는 날벼락 같은 시련이다. 하지만 이러한 마찰은 자연스러울뿐더러 건강한 대인관계에는 불가결하다고까지 생각하게 된다. 요지는 마찰이 아니라 해소에 있다. 그리고 바로 그 점이 우리 동명부대가 자랑스럽고 감사한 이유다.

‘초심을 유지하자’라는 말은 참 실천하기 어렵다. 결혼에 골인하기는 쉬워도, 유지하는 것이 더 힘들기에 우리는 우유빛깔 환상에 젖은 파릇파릇한 신혼은 축하해주지만, 빛바랜 금혼을 맞이한 부부에게 그 이상의 공경을 표한다. 때문에 단장님이 초지일관 초심을 유지하자는 말씀을 하실 때, 그 무게가 날이 갈수록 실감났다.

기대감에 부푼 출항 날 누구나 즐거워하고 유쾌할 수 있다. 하지만 육지로부터 수십만 리 떨어져 사정없이 내리치는 비바람과 파도 속에 마음의 평정과 희망을 유지하며 위기를 헤쳐 나가려면 특별한 그 무언가가 필요하다고 생각한다. 단순히 긍정적이어서는 아니 되며, 일에 능숙하다고 해서 되는 일도 아니다. 100일 지나고 200일이 지나 타인을 지적하고프고, 자신의 배는 더 채우고 싶은 욕망을 조절하는 것은 단지 한 사람이 하려 달려든다고 되는 일이 아니라, 서로에게 끌림을 유도하는 존중과 배려를 중시하는 하나의 문화가 형성되어야 한다. 그래서 우리 동명부대가 정말 대단하다. 분위기를 조성한 리더들은 ‘군은 임무에만 치중해 유연성이 미흡할 것’이라는 저의 선입견을 뒤집었고, 그 통제에 발맞추어 따른 모든 동명인들은 UN의 수많은 국가들에게 그 어떠한 하루짜리 수검이나 작전보다도 8개월에 걸쳐 선진(先進)이 무엇인지를 가장 잘 보여준 사례가 아닌가 싶다.

숨은 영웅들은 우리 주변에서 늘 묵묵히 임무를 수행하고 있다. 용맹함과 참된 군인정신은 화기(火器)의 크기와 장비를 능숙하게 다루는 능력에서 비롯되는 것만이 아님을 이역만리 레바논에서 비로소 실감했다. 격양된 목소리로 마켓웍스에서 혹은 레바논 하우스에서 전투적인 흥정 끝에 남녀노소 ‘하비비’로 훈훈하게 마무리되는 모습, 주임원사님을 비롯한 간부들이 직접 나서서 부대 곳곳을 청소하고 내 집처럼 아끼고 정비하는 모습, ‘나는 일개 병사니까’라는 인식을 탈피하고 초월하여 임무수행과 인수인계 절차에 열정을 쏟는 용사들, 그리고 각기 다른 어려움을 때론 묵묵히, 때론 소통하며 크고 작은 마찰들을 주체하지 못하는 화재가 아닌, 더 의미있는, 공감의 계기가 될 수 있는 따뜻한 마음의 불씨가 되게끔 주조(鑄造)하는 동명인들에게 깊은 감사함을 느낀다.

책의 머리말을 읽는 것은 쉽지만 맺음말까지 읽는 것은 어려웠던 것 같다. 갈등과 화를 주체하지 못해 다시 담지 못할 말을 내뱉는 것은 쉽지만, 그 오해를 연금(鍊金)하여 더 하나 된 부대를 만드는 것은 실로 어렵다. 하지만 우리는 음식을 담던 그릇에 금이 가면 도리어 생명을 담는 화분으로 쓸 수 있다는 자세로 서로 격려하고 끌어안았다. 240여일의 임무수행 끝에 우리는 보이지 않는 곳의 수많은 영웅들 속에서, 사람, 사고, 그리고 부대의 마스코트 동명이까지도 품을 수 있는 깊은 심성의 사랑꾼들 덕분에 안전하고 완벽한 임무수행은 물론, 수많은 국가들 중에서도 동양적인 절제와 예(禮)에 입각한 가까움과 전우애를 나누게 되었다. 한국에서도 해외 그 어디에서도 볼 수 없는 특별한 부대와 경험, 그것을 아름다운 장식에 이르게 한 것이 바로 우리를 진정한 대한민국 국가대표로 만드는 가장 큰 영광이 아닌가 싶다.

April 9, 2017 0

Category: Uncategorized