Introduction
I trained a machine learning model to play Lost Cities.
If you don’t know Lost Cities, it’s card game that is played between 2 players. It’s relatively simple, here are the rules: Rulebook. The game is particularly interesting to me because of it’s reward structure:
- There are 5 expeditions/piles, say red, green, blue, white, and black and 12 cards per expedition: 3 M’s and 2-10.
- As soon as you place a card into one of these expeditions, you get -20 points, plus the value of the card. So If I play the Green 5, I get -15 points.
- You cannot play cards with a numerical value less than what’s already in the pile, so M’s, 2, 3, 4 are out in the above example.
- M stands for multiplier and there are 3 per expedition, these multiply your expedition score at the end
- If I play just 1 M, I get
- If I play 2 M’s and then a 9, I have
- Lastly if you have 8 or more cards in the expedition, you get a flat +20 points.
Thus, to maximize rewards, a player must plan well in advance, including deciding whether or not to play multipliers, and which number to start on. There is additional tension due to the fact that you may only have 8 cards in your hand (each turn consists of playing a card and receiving a card), so hoarding cards for an expedition may limit your strategy for other expeditions. Furthermore, there is a strategic element in which you don’t want to tip your opponent off to which expedition you want to focus on. For example, if you play 2 green multipliers, your opponent is incentivized to hold on to green cards, or even play green cards at a loss to themselves in order to prevent you from playing them (while freeing up their hand for more strategic options).
Each turn you must first either play a card into an expedition or discard a card, and second draw a card from the deck or a discard pile. There are expedition-specific discard piles that are shared between the players, and you may only draw from the top of the pile; it’s a LIFO queue. You always start and end at 8 cards. The game ends as soon as the last card in the deck is drawn.
Rewards
Providing a proper reward to the model is very important because I want my model to maximize future rewards. I considered several possible reward schemes:
- Just your score
- Your score, with a bonus for winning and a penalty for losing
- The differential between your score and the opponents score
- The differential with the bonus/penalty.
- The differential, but only once at the end of the game.
- A flat +1 for winning, and a -1 for losing, just at the end.
Let’s consider some of the initial considerations we may have regarding those rewards:
- Nice and simple, directly linked to the game. An issue is that having a high score does not equal winning, since your opponent could have a higher score (perhaps it may be optimal to sometimes hinder your opponent rather than increasing your own score?)
- With the ending addition, we could choose a large enough constant such that the winner will always have the highest score. This is also nice because the bonus winning reward directly rewards… well, winning. The score really is just a proxy for winning
- The differential more directly encodes the fact that this is a zero sum game, at least in terms of winning/losing. Theoretically this might more frequently lead the agent to explore options that hinder your opponent rather than just choose actions that improve your score.
- Similar to above.
- Only showing the reward at the end of the game avoids a crucial problem: the initial score hit when starting an expedition. One could imagine that our agent may never learn to play multipliers, because playing one immediately incurs a penalty of -40, which might be painful enough to permanently close off that avenue for exploration.
- Perhaps the purest reward function of them all, you receive a unit reward for winning, and a unit penalty for losing.
I ended up choosing 5. I wanted to avoid acute penalties but I also wanted the agents to have some idea of the margin that they win by. Perhaps it will provide more information to help them see more precise effects of their actions.
Game Representation
How does my agent understand the game? I ended up with passing in a vector of length 901, broken down as follow:
Range | Representation |
---|---|
0 | # of Cards left, as a fraction |
001-720 | Discard Piles |
721-780 | Player Hand |
781-792 | Player 1 Pile 1 |
793-804 | Player 1 Pile 2 |
805-816 | Player 1 Pile 3 |
817-828 | Player 1 Pile 4 |
829-840 | Player 1 Pile 5 |
841-852 | Player 2 Pile 1 |
853-864 | Player 2 Pile 2 |
865-876 | Player 2 Pile 3 |
877-888 | Player 2 Pile 4 |
889-900 | Player 2 Pile 5 |
- The Player hand is a one-hot encoded vector. There are 8 ones in the length 60 vector, representing which cards are in the hand. You can think that the first 12 of the 60 are the cards in the Green Pile, the next 12 the red, etc.
- The piles established by each player are public information, those 10 possible piles are again represented by length 12 one-hot encoded piles.
- The Brutality and size of the discard piles are necessitated by one fact: order matters. You can only interact with the top card. For a given expedition, there are 12 possible cards that can be discarded. The simplest way to encode a card is just to use a length 12 one-hot vector. Thus 5 expeditions, 12 possible slots, and a length 12 vector needed for a card gives a total length of 720.
Many of you will question the decision to dedicate 80% of the state representation to the discard pile and I would like to say that I question it as well. I’d decided to have larger state in order to lose less information, but I think that the model might struggle to properly learn especially when encountering situations in which the discard piles are unusually full (which empirically seems very rare).
Actions are a vector of length 720, which can be decomposed into the combination of 2 phases:
- Action Phase: You can play a card into one of your piles or the shared discard piles. Since the cards themselves have expedition attribute, we can simplify this into just playing or discarding one of the 60 possible cards, for 120 possible actions in this phase. (For example, playing the Red 8 means you either have to put it in the red expedition or red discard)
- Draw Phase: You can draw from the deck or from the discard piles for a total of 6 possible actions in this phase. Combining and flattening this gives
Action masking is used to prevent impossible actions, such as preventing the agent from playing a card not in their hand, or drawing from an empty discard pile, or preventing the agent from discarding and drawing the same card.
Model
I chose Proximal Policy Optimization (PPO) as my model. There are many great resources online