Many recent caching systems aim to lower miss ratios of web caches, but there is no good sense among practitioners of how much further improvement can be achieved. We seek to answer this question by deriving the offline optimal cache miss ratio for Internet request traces.
While caching is a well-studied problem, most existing work focuses on the case where all objects have equal sizes. In contrast, Internet object sizes vary by several orders of magnitude. In this case, the offline optimum is strongly NP complete, and few approximation results are known. The best known result is a 4 approximation, which is insufficient for practical purposes.
We study offline caching under the assumption of stochastic independence (objects are now independently sampled with some probability) and as the number of objects goes to infinity. We relax the offline caching ILP, and show that we can still assign integer values to all but a vanishing fraction of the decision variables. Our key insight is a new representation of caching as a min-cost flow problem, which exposes a precedence relation that enforces integer solutions for almost all decision variables.