The payback math compounds, and so do efficiency techniques. Microsoft's application US20230316042A1 (“Mixture of experts models with sparsified weights,” published 2023-10-05) stacks two of them. Assigned to Microsoft Technology Licensing, LLC with inventors including Doug Burger and Eric Chung, it combines mixture-of-experts routing with sparsified weights.

Each technique is a cost lever on its own. Mixture-of-experts activates only part of the model per input; weight sparsification zeroes out and skips unneeded weights. Together they attack inference cost from two directions — fewer experts engaged, and cheaper math within those experts. The point is a model that is large in capacity but lean in cost per query.

“A method is presented for operating a machine learning model including one or more mixture of experts layers. The method comprises receiving one or more input data shards at a routing gate network for a mixture of experts layer comprising a plurality of neural network experts.”— U.S. Patent Application 2023/0316042 A1 source

Microsoft routes AI revenue through cloud and productivity segments and discloses no technique-level economics, as usual. The application is the granular record under the cost story: dated 2023, owned, and aimed at making large-model serving economically viable.

Published is not granted — scope is unsettled — and I attach no number; no filing isolates these savings. What the application documents is that Microsoft was patenting compounding inference-efficiency techniques in 2023, exactly the period when the cost of serving large models became the industry's central business question.

For the infrastructure desk, the frame is stacking. The companies that win on inference economics won't rely on a single trick; they'll layer routing, sparsity, and quantization. A patent that combines two of those in one filing is a primary document showing the layering strategy is deliberate, not incidental.