Lessons from OpenAI Five

I was fortunate to attend OpenAI Five today and play on stage in San Francisco, streamed live on Twitch. I was floored by the performance of OpenAI bots. The following is my dig on the bots and Dota.

The bots seemed to approach the game differently than the humans. While we (the bots and the humans) both know that the goal of the game is to destroy the ancient, we approach that goal differently. For example,

def Human(x): try: last hitting creeps
except: group as five

As opposed to

def OpenAI(x): try: group as five
except: last hitting creeps

Is the above a true simplification of the process? I checked OpenAI’s website and found this. This page details the values given to the bots’ reward functions. It appears the numbers are randomly generated at first, but over large empirical datasets, the optimal value is found. One of the important things to notice is the priority given to buildings. In contrast, last hitting is small. However, let’s take a look at the total weight given to buildings versus last hitting when accounting for gold and xp earned.

score for last hitting creep wave =
 
number of creeps * last hit weight + xp of creeps * xp weight + gold of creeps * gold weight
 
melee creeps = 3 * .16 + 57 * 0.002 + 150 * 0.006 = 1.49
range creep = 1 * .16 + 69 * 0.002 + 55 * 0.006 = 0.304
creep score = 1.49 + 0.304 = 1.794

This is the basic formula for 1 pre 5 min creep wave with no denies.

score for last hitting tower = tower bounty * gold weight + t1 tower weight
score for last hitting tower = 150 * 0.006 + 0.75 = 1.65

This is the basic formula for 1 T1 tower.

From the bot match I played, we lost our first T1 around 3 mins. The bots were efficient, only trading about 1 wave for 1 tower. They move quickly from the objective back to lanes or to the next objective. On napkin style math, it appears a perfectly CS-ed wave is worth a little more than a T1 tower. One element I did not weigh, however, is what OpenAI calls “team spirit”. Bots value their teammates scores very highly. How often have I played a pub where my safelane carry refuses to TP because “it wouldn’t be worth it for me to miss this wave”? In contrast, bots are happy to rotate early (their mid laner left the lane at level 4 to secure T1 top). It is not in their code to push as 5, but it is learned that cooperation supersedes individual success. This was also seen by the cores letting supports last hit in lane.

So what about game 3 in casters vs bots? Why did the bots lose so badly? One of the problems Purge noted is the bots were not prioritizing farming other lanes. They would cut creeps and hit towers, but they weren’t TP-ing away as soon as they saw an enemy. Slark was close to his shadow blade for so long, yet didn’t clear some camps to buy it. My brother noted this as a fatal flaw of generality. The bots were not able to adapt to poor game situations and therefore are not as good as he thought. I am here to defend our future overlords (may the bots remember my piety).

By not letting the bots draft their own heroes, it renders their empirical dataset useless. The reward functions are specifically created to maximize their probability of winning a game given certain assumptions– such as drafting. Purge asks why they don’t finish their items and split the map better. It’s because they can’t. It is not in their learned behavior of how to win games. In a situation where they have their optimal lineup, it would be a waste to farm a lane and finish an item. One of my friends argues that changing the draft conditions is as severe as adding a new hero. It changes the immensely complex landscape that is Dota 2 and the bots have never seen it before. The early OpenAI blog posts tell us the first few agents wandered around the map aimlessly before even learning to last hit. Similarly, the bots in game 3 stayed in trees or die backed because they didn’t know what to do.

In summary, given a model the bots have trained, they will always win. Change the fundamental assumptions of that model, and sure, humans will win.

 

I’m going to walk step by step through my 13 min game against the bots.

Draft

The bots were permitted to draft their 5 line up after our 5 line up. We did not have an opportunity to counter their draft. However, I also did not know we were playing a previous patch. Necro only recently received some much needed buffs. Additionally, we picked Slark because one of our players was only comfortable playing that or Sven. I remember hearing Blitz say the bots group early, but I didn’t say anything more about the Slark pick despite it being a very weak early team fighting hero. One of the challenges a human captain always has to balance is comfort versus optimality.

Laning

The bots know Slark is a weak laner so they tri-laned against him, despite dual lanes being the meta. This might seem small, but humans are so prone to following the status quo. Dual lanes are the meta? We go dual lanes. In order to protect our weak carry, CM tp’s top to help the lane. I’m left in a 1v1 as Necro vs DP. Because we didn’t see the trilane top until later, CM tp’s top after 1.5-2 creeps in my lane. The result is DP hits level 2 before me and I get zoned out of the lane early, missing a range creep. The bot is already ferrying salves to himself, but my first courier has a null talisman recipe and a wand recipe. The bot has higher base damage, a level advantage, and now more regen. Additionally, the bot did not seem to mind missing a few last hits in order to trade hits with me more. One of the interesting things I noticed is walking into fog immediately stops the bot from trading with me as long as I wasn’t in kill potential range.

Pushing

The bots push hard and push early as five. As a human, I saw DP leaving my lane as potential to hit a few creeps. I did not defend my tower and they just continued to push. They always had the threat of ravage, the strength of DP ult, and the sustain from WD. With a slark, viper, and 2 disables, our team didn’t have enough damage to get someone bursted low enough to throw a necro ult. We lacked the ability to reach. The bots were happy to throw spells and just damage us, making it hard to defend. If they saw we weren’t as 5, they commited and killed one of us. A popular human dota technique is sending one person to defend while the cores push out the lanes more. The bots don’t allow that. They know you’re out of range and they’re in range. The tower pattern they went for was T1, T2 safe, T1 mid, T3 safe, ranged rax, melee rax. A classic sequence for seasoned players.

Late Game

yeah right. What late game?

Lessons

The bots are not afraid of going high ground early, something humans are often taught to fear. They do not rosh then high ground, as humans are drilled to do. How often has my team been in a momentous push mid, but we backed to secure rosh first? (Often delaying the next push by a few minutes). Is it possible humans are just risk adverse? They don’t want to suffer the potential consequences of a failed high ground push. Or possibly that the humans don’t prioritize pushing as much as farming?

The bots also make me view the traditional roles differently. To them, they don’t have a 1, 2, 3, 4, 5. They have someone with reach, someone with damage, someone with control. Who farms the lane is whoever needs something to help the game towards victory. As a drafter, I often say things like “Ok now I need to pick a carry” or even worse, “Ok now I need to pick a Frempo hero” or whoever. The bots think about heroes in terms of who’s going to add the greatest marginal benefit. They don’t think about roles or comfort picks.

When I get home, there are a few things I want to try. 1.) lane mechanics. This means trading more with the other core, playing with different levels of regen. 2.) open-minded drafting. Less concern about the meta and more focus on “how do I win this game?”. What does my next pick need to offer my team?