You are on page 1of 8

Techniques for Opponent Modeling in Poker

Mike Chase University of Waterloo, Waterloo, Canada October 28, 2005


Abstract Poker presents many challenging problems to articial intelligence, with the most signicant being opponent modeling. This paper examines a variety of current techniques for deducing an opponents next action and an opponents hidden cards. This paper also presents a POMDP for modeling opponents whose strategies vary over time.

Introduction

The articial intelligence community has had a long-standing interest in games. However, until recently, the games studied have been deterministic games of perfect information. In contrast, a game like poker is instead a partially observable, stochastic game: stochastic because cards are dealt randomly to a player, and partially observable because a player does not know with certainty the cards held by his or her opponents. Poker presents other challenges as well. In particular, it is not enough to play following a Nash equilibrium. A strong AI poker player must exploit the strategies and weaknesses of its opponents in order to maximize its winnings [2]. Thus, it becomes essential for any successful AI poker player to model its opponents. Opponent modeling is a complex problem. An opponent modeling system faces the challenges of uncertainty and imperfect information, and must determine which factors are important to a player and how a players strategy changes based on which opponents he or she plays against [5]. Furthermore, AI poker players can only infer an opponents strategy based on previous observations. Expert poker players change strategies often during a game to throw o their opponents, but this is hard to capture, as machine learning techniques tend to require a large number of observations before they converge. Humans also exhibit intuition: a strong human player can determine the strategies of weak or average opponents after a few hands, a feat hard to match in an AI poker player.[5]. This paper examines the role of an opponent modeler in an AI poker player and then surveys current techniques, concluding with an examination of potential improvements that will increase the accuracy and quality of opponent modeling systems.

Texas Holdem

The most popular variant of poker to be studied by the articial intelligence community is Texas Holdem, which is widely considered to be the variant of poker requiring the most complicated strategies. This paper briey outlines pertinent rules of Holdem, but readers unfamiliar with Holdem are referred to the poker literature or to papers such as [1]. The goal of a hand of Holdem is to win the money bet by all players by having the highest-valued ve-card poker hand comprised of two cards from an individual players hand and ve community cards, dealt face-up and available to all players. To begin a hand, each player is dealt two cards face down. The rst two players to the left of the dealer pay the small blind and big blind into the common pot. The blinds are forced bets to ensure that players are not playing for an empty pot. Each player can choose whether to match the current betting level, to raise the betting level, or to fold, removing themselves from the rest of the hand without further cost to the player. The betting round stops when every player has either folded or matched the current highest bet. Three community cards, known as the op, are then dealt for all players to see. Another round of betting follows. Two more cards are dealt, known as the turn and the river, with rounds of betting following each card. After the nal round of betting, all remaining players show their hands, and the player with the highest-valued hand wins the pot. If at any point, only one player remains before the nal showdown (due to other players folding), that player wins the pot without having to reveal his or her cards to others.

A Framework for Opponent Modeling

An opponent modeling system has two goals. Firstly, it is important to have an accurate knowledge of the probability distribution over an opponents possible cards, given their past tendencies and actions this hand. Secondly, an opponent modeling system should deduce the action that an opponent will take in a given situation. This paper introduces the terms hand deduction and action deduction to describe these goals. To understand an opponent modeling system, it is essential to describe the place of these goals in the context of a complete AI poker player. Papp provides two metrics used to help a player evaluate his or her hand in relation to the possible hands held by opponents [8]. Hand strength is the probability that a players hand is the strongest hand based on all known information, and hand potential is the probability that a players hand will be the winning hand based on future information (the values of any cards that are not yet dealt). For example, a player holding 5- and 2- has a lower hand strength than an opponent holding A- and 8-, but if the op is 3-, 4- and 9-, the rst player has a higher hand potential, because the rst player is more likely to draw either a straight or a ush as more community cards are revealed. These two metrics can be used to inuence the decision that an AI poker player will make, and both are dependent on hand deduction to obtain an accurate idea of what cards an opponent might hold. The Poker Research Group at the University of Alberta represents a players actions as 2

probability triples P T = [f, c, r ], such that f + c + r = 1.0 [3]. This represents the probabilities that a player will fold, call, or raise. Action deduction involves nding a probability triple for an opponent given the current game state, and can be used to indicate how an opponent will respond to a particular play by the player doing the modeling. One such method is to simulate a large number of hands beginning at the current state, and choosing the move with the highest expected value. Simulation and other methods will produce higher-quality results with a strong action deduction system.

4
4.1

Opponent Modeling Techniques


Hand Deduction

Loki was the rst AI Poker player written by members of the Poker Research Group at the University of Alberta. Loki maintained a table of weights for each opponent, capturing a probability distribution over an opponents possible hidden cards [3]. After the op, turn, and river cards are dealt the initial weight tables are updated for all opponents, since some hands are now less likely to occur. The weight table is also reweighted for an individual opponent after each move they make. Initial weights for each hand are not just the probability of that hand occurring. For each opponent, Loki determines the frequency of a particular opponent folding, calling, or raising before the op, and determines the value of , the average winnings that the player obtains for each choice, measured in small bets won per hand [2]. To estimate a prior distribution over the opponents hand, Loki uses a set of expert rules mapping particular hands to given income rates. Given a variance, , of the exact threshold needed for an opponent to call, hands with expected income rates less than are given a weight of 0.1, hands with expected income rates greater than + are given a weight of 1.0, and hands in between are given linearly interpolated weights [2]. The weight table is then normalized. In post-op play, a similar process is used for updating the weights. The main dierence is that the weight table is reweighted based on frequencies and average winnings learned for certain categories. Loki maintains separate counts based on the current round (pre-op, op, turn, or river), the action taken (fold, call, or raise), and the action cost (bets of 0, 1, or > 1 small bet) and uses the corresponding frequencies to reweight the weight table [2]. Loki updates the weight table for each possible hand by rst considering the probability that an opponent would have made the move that they did given the hand that they hold [3]. As an example, suppose the opponent holds a pair of queens, and the corresponding entry in the weight table is 0.5, meaning that there is a 50% chance the opponent has a pair of queens, given their observed play so far this hand. If we have a probability triple generated by an action deduction system representing the opponents likely response as [0.1, 0.2, 0.7], and the opponent called, the new entry in the weight table is 0.5 0.2 = 0.1. The value has decreased because it is more likely that the opponent would have raised instead of calling, but they called anyway. The table of weights is a robust and simple method. It scales well to any number of 3

opponents and is computationally ecient. The most signicant aw is that the frequency counts are limited to a small number of situations, some of which rarely or never occur in practice, such as players folding when they can call the current bet without putting in any more money. Another pitfall is that it assumes that the hands of opponents are independent [8].

4.2

Action Deduction

Aaron Davidson used a neural network to model opponent actions, using a wider set of features than Loki used. He used a feed-forward network with a sigmoid activation function, one hidden layer of four nodes, and three output nodes corresponding to the three values in a probability triple [6]. The best results were obtained by training on a set of data capturing six players of varying levels of skill. He found that by training without pre-op information, the accuracy increased from 55-70% to 75-90%, likely because the network was forced to learn generalizations for both pre-op and post-op play, which are very dierent in practice. Davidsons neural networks also suggested that the previous action and previous amount called were very signicant features. These two features were incorporated into the reweighting algorithm for the table of weights in Poki, Lokis successor, giving a noted improvement [4]. Davidson also suggested training a neural network for each opponent as the game progresses, but it is likely that too many hands will be needed for the network to converge. The results will continually improve, assuming a stationary opponent, but may not be satisfactory overall. For this reason, it may be better to use a pre-trained neural network. An improvement in accuracy can be given by using confusion matrices, such as the matrix in Table 1. A confusion matrix is a 3x3 matrix where the columns indicate the frequency with which an action deduction system predicted a particular action and the rows indicate the frequency with which the opponent performed a certain action [6]. The diagonal contains the percentage of correct predictions. Confusion matrices can be used to compensate for the bias of a particular predictor. For example, the predictor represented in Table 1 predicts a raise 11.8% of the time, but 21.2% of predicted raises were actually calls. If we multiply the probability triple by the confusion matrix, we get a more accurate triple that takes the weaknesses of an individual predictor into account. Table 1: Sample Confusion Matrix Actual Fold Actual Call Actual Raise Predicted Fold Predicted Call 0.19 0.005 0.0 0.575 0.0 0.112 Predicted Raise 0.004 0.025 0.089

There are a variety of possible action deduction systems, and there can be many variances for each. For example, a strong AI poker player might have several neural networks, each 4

trained with dierent amounts of data, such as a full history, only recent events, or samples of both old and recent events [6]. Instead of solving the problem of which action deduction system to use, Davidson suggests using multi-predictor voting : using all available predictors and combining the result of each one, weighted by the accuracy of the predictor [5]. The accuracy is stored as a vector with one component for each action and is obtained from the confusion matrices. Given m predictors, each with probability triple i and accuracy vector ai , the corresponding calculation is:
m

=
i=1

i ai

(1)

In Equation 1, the two vectors are combined by multiplying each element together. For example, for predicted probability triple i = [0.3, 0.5, 0.2] and accuracy vector ai = [0.7, 0.2, 0.1], the resulting unnormalized triple is [0.21, 0.1, 0.02]. In order to avoid penalizing predictors for past inaccuracies, a window over the last k actions can be used instead of all n previous actions [5]. Davidson also proposed using decision trees as an alternative to neural networks [5]. Decision trees are simpler than neural networks, and information encoded in the tree is more accessible than information encoded in a network. However, Davidson found neural networks to be more accurate in preliminary experiments, which he attributed to the fact that that information in poker is noisy and neural networks are more noise-tolerant [5]. These strategies for action deduction are subject to the same factors inuencing accuracy and performance as other machine learning approaches. The training data and parameters used in training are signicant, and a major problem is determining how diverse training data should be. Confusion matrices and multi-predictor voting improve the quality of modeling, but the most signicant problem is what training data to use and how to obtain it. Another challenge is selecting the right set of parameters to supply to an action deduction system. Selecting too many parameters increases the time spent training a neural network, but too few parameters yields a less accurate model. One possibility might be to use beam search and entropy to quickly identify the most signicant parameters, and then train networks using those parameters.

4.3

Bayesian Opponent Modeling

There are two papers that explore the use of Bayesian techniques applied to poker. Both of them capture parts of hand and action deduction in the context of a complete system, and the systems described must be discussed in their entirety instead of focusing on either component. Korb et al. created four Bayesian networks designed for two-player ve-card stud, one for use at each betting round in that game, which can be easily adapted to Texas Holdem. They each have dierent conditional probability tables, but the network structures are identical [7]. The purpose of their network is to deduce the probability that their AI player will win the hand, given the current state of the game and observed actions. Opponent 5

modeling takes place in three nodes in the network: one capturing a distribution over the opponents actions, another for the current set of community cards, and a third representing the opponents current hand, including community cards. The most signicant drawback to Korbs approach is that the network can only provide generic opponent modeling since the conditional probability tables are xed. Two opponents may have dierent strategies, but the Bayesian network will interpret the same actions by both opponents in the same way. The network also has no history beyond the beginning of the current hand, so patterns in an opponents play can never be noticed or exploited. Southey et al. calculate the strategy space of an opponent in two-player Holdem. They use Bayes rule to calculate both P (Hs | ) and P (Hf | ), the probabilities of a hand going to the showdown or that one of the players folds, given the opponents strategy space [9]. Hs and Hf are tuples of the form (C, D, R1:k , A1:k , B1:k ), where C and D represent the player and opponents hidden cards, Ri is the set of public cards dealt before each player makes decision i, and Ai and Bi are the ith decisions made by the player and opponent respectively. A strength of this approach is that P (Hs | ) and P (Hf | ) depend on everything that has occurred so far in the hand by both players. To extract an opponents overall strategy, they begin with a set O = Os Of of observations for the opponent. The posterior over the opponents strategy space is: P ( |O ) = P ( )
Hs Os

P (H s | )
Hf Of

P (H f | )

(2)

Southey et al. used two dierent priors in their experiments, and found an informed prior based on ve expert-identied factors to outperform an independent Dirichlet prior [9]. The Bayesian methods, particularly that of Southey et al, are robust and accurate, but are not likely to scale well to games of poker with many more than two players. The conditional probabilities of both models will need to be extremely large even to make use of a simple set of parameters. Another problem with Korbs method is that their Bayesian network does not explore direct dependence of players on the actions of other players, an assumption of independence which rarely holds in poker.

Modeling Strategy Changes

Current opponent modeling techniques make two assumptions. Some assume that the interactions between opponents are independent, and all assume that opponents have a stationary strategy. All that is necessary for these techniques is to capture an opponents initial strategy well in a small number of hands, and since the opponent will never deviate from his or her strategy, the opponent will be successfully exploited. The main motivation for this is that solving opponent modeling for simpler cases is necessary before considering nonstationary opponents [9]. However, successful opponent modeling systems must assume a non-stationary opponent. The idea of multi-predictor voting suggests a possible technique that could be used to model non-stationary opponents. Suppose an opponent modeling system uses a table of 6

weights for hand deduction and neural networks for action deduction. Training neural networks on current opponents as the game progresses will take too long to converge and will be unsuccessful against expert opponents. Instead, the new opponent modeler should use a variety of neural networks trained for dierent strategies. In addition, when switching from one strategy to another, hand deduction can adapt to a new strategy by resetting the frequency counts for the table of weights. Individual strategies could be expert-dened, and one or more neural networks could be trained per strategy, using multi-predictor voting to get the best results for a given strategy. This approach depends on the mechanism for determining which strategy an opponent is currently playing, which can be solved using a Partially Observable Markov Decision Process, or POMDP. A POMDP is dened as (S, A, T, R, O, Z ) for a set of states S , a set of actions A, a transition probability T = P (s |s, a), set of rewards R, set of observations O , and observation probability Z = P (o |s , a). For the problem of opponent strategy modeling, each state corresponds to a dierent strategy that an opponent might play. The POMDP can only estimate the current state (i.e. estimate the opponents current strategy) based on observations. Actions correspond to the player updating its belief about the strategy an opponent is playing. The transition probabilities should be xed at 1.0, since the POMDP transitions to state s with certainty that it will actually end up in s . The reward function is the expected utility of playing against a particular strategy and is solely dependent on the state, not the action with which the POMDP reaches that state. The reward for state s could be learned by playing against a stationary opponent using strategy s. Observations are given each time an opponent folds, calls, or raises, and when an opponent reveals his or her cards. Unlike many POMDPs, it will be necessary for each state to have an action that transitions back to the same state. This is because an opponent does not change strategy with every action he or she makes, so in order to fully exploit the opponent, the POMDP will remain in the same state for many observations. This approach should model an opponent more accurately than previous methods because it accounts for an opponents potential to change strategies. It also takes advantage of signicant dierences between expected and actual actions. For example, if an action deduction system expects an opponent to fold with high probability and the opponent instead raises, this observation indicates that the opponent could be playing a dierent strategy and the POMDP will transition to a new state [6]. There are a few considerations with this approach. Solving POMDPs is computationally complex, so approximations may be appropriate. One possibility is a Hidden Markov Model with strategies as the observed nodes and observations as the hidden nodes, making a solution more tractable at the expense of solution quality. The model is also highly dependent on expert information. The choice and variety of strategies is signicant, as is the set of observations that would prompt the POMDP to transition out of state s or to remain in s. In practice it might be that one or more temporary states are needed between states corresponding to specic strategies, since the same observation might occur for dierent strategies.

Conclusions and Future Work

This paper examined several strategies for opponent modeling. Using a table of weights is a strong method for hand deduction and neural networks, combined with confusion matrices and multi-predictor voting, provide a strong framework for action deduction given a stationary opponent. The Bayesian methods described work well for two-player poker but are probably limited to a small number of opponents. Current techniques do not handle non-stationary opponents well, and much work in this area will be needed to model opponents with sucient accuracy. This paper proposed a POMDP for modeling non-stationary opponents with a set of varying strategies. Future work in opponent modeling should be centred around two areas. One necessary improvement in quality will be rening existing models to account for the fact that opponents do not have hidden cards that are independent from those of other opponents. More importantly, future research will need to nd ways to model non-stationary opponents eciently and accurately, using POMDPs or other methods.

References
[1] Darse Billings, Aaron Davidson, Jonathan Schaeer, and Duane Szafron. The challenge of poker. Articial Intelligence, 134(1-2):201240, 2002. [2] Darse Billings, Denis Papp, Jonathan Schaeer, and Duane Szafron. Opponent modeling in Poker. In Proceedings of the 15th National Conference on Articial Intelligence (AAAI98), pages 493498, Madison, WI, 1998. AAAI Press. [3] Darse Billings, Lourdes Pena, Jonathan Schaeer, and Duane Szafron. Using probabilistic knowledge and simulation to play poker. In AAAI/IAAI, pages 697703, 1999. [4] A. Davidson, D. Billings, J. Schaeer, and D. Szafron. Improved opponent modeling in poker. In Proceedings of the 2000 International Conference on Articial Intelligence (ICAI2000), pages 14671473, 2000. [5] Aaron Davidson. Opponent modeling in poker: Learning and acting in a hostile and uncertain environment. [6] Aaron Davidson. Using articial neural networks to model opponents in texas holdem, 1999. http://spaz.ca/aaron/poker/nnpoker.pdf. [7] Kevin B. Korb, Ann E. Nicholson, and Nathalie Jitnah. Bayesian poker. In UAI-99, pages 343350, 1999. [8] D. Papp. Dealing with imperfect information in poker, 1998. [9] Finnegan Southey, Michael Bowling, Bryce Larson, Carmelo Piccione, Neil Burch, , and Darse Billings. Bayes blu: Opponent modelling in poker. In Proceedings of the TwentyFirst Conferece on Uncertainty in Articial Intelligence, pages 550558, 2005. 8

You might also like