“Market Space” as a Multi-Armed Bandit Problem
In the last post, I presented my 30,000-foot view of Market Space. I think it already provides some intuition. From the final diagram, you can literally see the firm’s search problem. But our eventual goal is a more formal model. So we need to drop our altitude a bit.
First, let’s introduce some notation:
- M denotes Market Space.
- F denotes the hyperplane of M consisting of dimensions characterizing product features. Call it Feature Subspace. For convenience, we’ll use an italicized capital letter as shorthand for a point in a subspace, e.g., F denotes a set of values f1… fi in the i dimensions of Feature Subspace.
Thus, a point F defines a bundle of product features. Note that such a point in Feature Subspace also defines another hyperplane within the whole of M; taking the product features set at F as given, this hyperplane represents all possible states of the other, non-feature variables. These other variables affect the demand for and cost of features, but from the firm’s perspective, they are more difficult to predict and control
The firm’s problem is therefore to decide what products it wants to make in the future to maximize “goodness” (be they the same as current products, updates to current products, or completely new products). It has limited resources, so it can only work on a very small fraction of all possible products.
Now, instead of continuing to use the term “goodness” let’s call it “value”. Conceptually, there’s a function V(F) that calculates the value for a point in M based on the product features chosen by the firm and the states of the non-feature variables. Given this formulation, a firm’s problem is to try and find the set of points in F that maximize V.
But making an exact prediction of V would require knowing the exact values of all those other variables, which the firm obviously doesn’t. Such a prediction would also require knowing the precise formula reality uses to calculate V, which it doesn’t either. In fact, anyone who has worked on delivering a product knows that the firm probably doesn’t even know the precise location a product will actually occupy in Feature Subspace; a delivered product’s features rarely match the original specification exactly.
So really, a firm choses a point Fp in F, which is the proposed point in Feature Subspace. Then it attempts to estimate the value of that point:
E(Fp) = V(Fa) + εtot
where V(Fa) is the true value of the actual point in Feature Subspace and εtot is the total error from misestimating what features will actually be delivered, the uncertainty of values for all the non-feature variables, and misunderstandings about reality’s true formula for calculating V.
Enter the Multi-Armed Bandit
It turns out that this formula looks a lot like an existing problem: the multi-armed bandit (MAB). The idea is that you walk into a casino full of slot machines, some of which you believe are positive expected value, but you don’t know the precise payoff distribution of individual machines. You can only pull a limited number of levers per unit time so you want to allocate those pulls to maximize your long-term payoff. See this survey paper for an overview.
The MAB problem in general goes back to this paper by Herbert Robbins in 1952 . Its application to the economics of the firm first appear in this 1974 paper by Michael Rothschild, exploring the problem of price discovery. Then in 1979, Martin Weitzman explored the problem of a firm’s funding research projects in this paper [Note: I found this paper in particular a fantastic read and highly recommend it]. I think these guys were on the right track. They just didn’t take the approach far enough. My instinct is that the firm’s entire existence is predicated on trying to solve the “master” MAB problem of what products to make (using the term “product” to encapsulate all the characteristics of what the customer experiences).
Personally, I was reintroduced to the MAB at a talk given by Google researcher Steven Scott based on this paper. While the paper discusses a specific Bayesian heuristic for solving MAB problems, called randomized probability matching, I think the Introduction concisely captures the two key challenges of playing MABs:
- “The choice involves a fundamental trade-off between the utility gain from exploiting arms that appear to be doing well… vs exploring arms that might potentially be optimal… The trade-off has also been referred to as ‘earn vs learn’.”
- “The multi-armed bandit problem is notoriously resistant to analysis, although optimal solutions are available in certain special cases. The optimal solutions are typically hard to compute, rely on artificial discount factors, and fail to generalize to realistic reward distributions.”
These two challenges immediately lead to two conclusions with respect to firms:
- Firms must allocate their resources between current production activities and searches for future opportunities.
- Firms will not be able to consistently calculate the optimal allocation.
These conclusions are hardly trivial with respect to firms. The first one tell us that “innovate or die” is a mathematical truth. No firm can expect to survive for long without changing its ways. The second one tells firms to plan for failures. No firm can get everything right. Even trying to achieve the minimum possible number of errors is an intractable problem.
My observation is that when faced with such a daunting challenge, but a substantial reward to overcoming it, humans try to cheat. Therefore, we should expect firms to exploit any gaps between reality and the abstract MAB problem. I can think of four immediate possibilities:
- Changing Payoffs. The basic MAB assumes each arm has the same payoff distribution over time. But, from the perspective of a firm, a given product’s success will depend greatly on when it decides to make that product. Vinyl records were great in the 50s, not so great in the 90s. Cell phone cases were OK in the early 00s, very good in early 10s. These difference stem from the non-feature variables that evolve over time and affect the value function.
- Competition. The basic MAB assumes the player can pull any arm in the casino and capture all the payoffs from that arm. But firms know that making a successful product means they will attract imitators So it’s as if they either can’t play arms that are already occupied or players who choose the same arm have to split the payoff.
- Dependent Payoffs. The basic MAB assumes that the payoffs for each arm are independent. But for products, it seems like very similar feature sets would have very similar payoffs, holding the non-feature variables constant. Sure, there may be crucial dimensions to which V(F) is very sensitive, but there will be some significant correlation among Vs for neighboring points in F.
- Nearby Learning. The basic MAB assumes that the player only observes the results from an arm he plays. But firms get some information from products made by other firms by simple observation or reviewing disclosed results. Firms even get some information from arms nobody plays at all if they are close to played arms, perhaps by getting customer feedback or doing market research. To complete the MAB metaphor, I picture a player being vaguely aware of how well the players around him are doing and being very aware when sirens go off announcing a big jackpot win.
Unfortunately, Changing Payoffs makes things worse, not better for the firm. It’s actually known as the “restless bandit” problem and has been studied. Even approximating solutions for restless bandits is PSPACE hard (potentially harder than NP) according to this paper [Note: I think this result also argues strongly against trying to pick winners or winning sectors when investing in startups.]
There has also been some work on Competition in the MAB problem. This paper shows that the optimal rules for competition produce total outcomes that are only slightly worse than the single player version. Again, no help to the firm. [Note: I actually think this is a really interesting result that someone in economics should write a paper on. I think it’s a huge endorsement of free markets.]
Luckily, I think Dependent Payoffs and Nearby Learning show promise. A dependency structure and the ability to detect gradients would allows us to apply a lot of standard optimization tools. Such dependency and detectability seems likely. Let’s say a point F1 defines the true iPhone 5S, while F2 defines an alternative iPhone 5S with slightly different physical dimensions, screen resolution, processor speed, etc. V(F1) is going to be highly correlated with V(F2). Moreover, it’s plausible that Apple’s customer research could have produced a reasonable estimate of both V(F1) and V(F2) once they knew the actual V of the previous generation iPhone 5.
Another way to look at it is that Dependent Payoffs and Nearby Learning make the “option value” of product innovation significantly higher. If a firm goes off and makes a totally radical product, it’s going to learn exactly how much demand there is for that product. But it will also learn something about how much demand there would have been for a bunch of somewhat similar products. It can use this knowledge in the future to take advantage of any initial traction by incrementally zeroing in on a “good patch” of Market Space.
I’ll take this angle of attack in the next post.