1 Machine learning systems

1.1 Why are successful machine learning systems so difficult to build?

Machine learning, once you understand the modeling element, feels easy at first: we can take a dataset, train a model, and show a demo that looks impressive. But in production, AI/ML is notorious for being a project graveyard.

For example, a Fall 2025 report summarizes interviews with over 100 executives and senior leaders about the extent to which their organization had successfully adopted task-specific GenAI tools marketed to enterprise customers. 60% said their company had investigated one or more task-specific GenAI tool, but only 5% said the tool was successfully implemented.¹

Pilot-to-production drop-off for task-specific GenAI tools (from “State of AI in Business 2025 Report”).

One interviewee said: “The hype on LinkedIn says everything has changed, but in our operations, nothing fundamental has shifted.” Another (CIO) said: “We’ve seen dozens of demos this year. Maybe one or two are genuinely useful. The rest are wrappers or science projects.”

This is a familiar pattern: you can get to an impressive ML prototype quickly, but the last stretch to turn that into a reliable and useful system is much riskier and more expensive. This isn’t a new idea, either: in 2015, the seminal paper² that established the field of MLOps noted: “developing and deploying ML systems is relatively fast and cheap, but maintaining them over time is difficult and expensive.”

This is obviously bad! Projects that never ship waste time and money. Projects that ship and then fail in production can do even more harm: they can harm users, create operational chaos, and increase risk for the organization.

1.2 Case study: Zillow Offers

One of the classic first “toy” regression examples is house price prediction. House price prediction also happens to be one of the most infamous examples of an ML system failures.

In 2018, Zillow, an online real estate marketplace launched Zillow Offers: a service that allowed Zillow to buy homes directly from owners, renovate them, and resell them.

How Zillow Offers worked. Image source: GeekWire

The business depends on forecasting near-term home prices, making an offer to the homeowner that is high enough to be accepted but low enough to preserve profit margins, then do any necessarily renovations and resell the house for a profit.

How Zillow Offers worked. Image source: Hank Bailey

Zillow’s Zestimate is reportedly pretty good. Zillow publicly reports a median absolute percentage error of less than 2% for homes that are actively listed for sale, meaning that for half of on-market homes, the Zestimate is within 2% of the eventual sale price. This number varies by region - for the New York metro area, it’s 2.83% and for San Francisco, it’s 3.08% - but those are the worst results of all major metro areas. If you are willing to tolerate more error, 83.95% of sales are within 5% of their Zestimate (72.07% in New York metro area), 95.39% are within 10% (92.35% in New York), and 98.83% are within 20% (98.72% in New York).

Zestimate accuracy bands as of July 2025 (national vs New York metro area).

Building on this very good model, the home-buying business was intended to work as follows: Zillow estimates the price at which it can resell a home, net of expected renovation costs. It then makes an offer to the homeowner that preserves a target profit margin after those costs. If the owner accepts, Zillow purchases the home, completes the renovations, and resells it, keeping some profit. The following table shows a hypothetical example, with a median absolute error of 2.45%.

Home	Zestimate	Renovation	Offer	Resale Price	Error	Zillow Profit
A	314	3.5	295	320	-1.9%	21.5
B	360	3.5	339	355	+1.4%	12.5
C	404	3.5	380	410	-1.5%	26.5
D	470	3.5	444	465	+1.1%	17.5
E	511	3.5	482	520	-1.7%	34.5
F	558	3.5	526	575	-3%	45.5
G	628	3.5	593	610	+3%	13.5
H	629	3.5	593	660	-4.7%	63.5
I	770	3.5	731	720	+6.9%	-14.5
J	835	3.5	793	780	+7.1%	-16.5
Total						203.5

To see how this played out financially, we can look at Zillow’s Homes segment adjusted EBITDA as reported to their stockholders. We can think of this as a rough measure of operating profit or loss. When it is negative, the business is losing money.

Zillow Homes segment operating profit or loss over time (adjusted EBITDA, USD millions; negative means the business lost money).

What went wrong? Even if their model prices homes correctly on average - with some overestimates and some underestimates - homeowners are not equally likely to accept offers across those errors. Owners of lower-quality or harder-to-sell homes, where the Zestimate was higher than the true value, are more likely to accept, while owners of higher-quality homes know they can get a better sale price and tend to reject. (Note that the owners have “inside information” about the condition of their homes, which Zillow does not have access to!) So even if the pricing model is correct on average, the expected value of the purchased homes is lower.

For example, here’s our hypothetical example again, but without assuming that all owners accept the offer:

Home	Zestimate	Renovation	Rounded Offer	Resale Price	Error	Zillow Profit
A	314	3.5	295	320	-1.9%	21.5
B	360	3.5	339	355	+1.4%	12.5
C	404	3.5	380
D	470	3.5	444	465	+1.1%	17.5
E	511	3.5	482
F	558	3.5	526
G	628	3.5	593	610	+3%	13.5
H	629	3.5	593
I	770	3.5	731	720	+6.9%	-14.5
J	835	3.5	793	780	+7.1%	-16.5
Total						34

Zillow Offers also operated in a competitive market with similar home-buying businesses run by Redfin, Opendoor, and Offerpad. This competition introduced a classic Winner’s Curse dynamic: when multiple firms made offers on the same home, the seller would accept the highest bid. As a result, the company whose model made the largest positive error - overestimating the home’s value the most - was the most likely to “win” the home. This systematically biased acquisitions even more toward overpriced homes.³

That’s not the only problem. The Zestimate error numbers we cited earlier were for homes that were already listed for sale. For these homes, Zillow gets updated images, listing price, description, and “days on market” features that are only available for homes that are listed for sale - and that make its Zestimates much more accurate. (Many users report that after listing a home for sale, the Zestimate is updated to align more closely with the listing price.)

The Zestimates for off-market homes typically have much higher error, with a median average percent error around 7%:

Zestimate accuracy bands for off-market homes (national vs New York).

and Zillow was also making offers for homes that were not listed for sale. What happens in our hypothetical if we have this higher median absolute error?

Home	Zestimate	Renovation	Offer	Resale Price	Error	Zillow Profit
A	308	3.5	289	320	-3.75%	27.5
B	340.8	3.5	320	355	-4%	31.5
C	381.3	3.5	357	410	-7%	49.5
D	395.2	3.5	369	465	-15%	92.5
E	374.4	3.5	345	520	-28%	171.5
F	579.3	3.5	547	575	+0.75%	24.5
G	640.5	3.5	607	610	+5%	-0.5
H	712.8	3.5	676	660	+8%	-19.5
I	849.6	3.5	810	720	+18%	-93.5
J	982.8	3.5	940	780	+26%	-163.5
Total						120

Considering the bias in favor of too-high offers being accepted, it becomes much more difficult to stay profitable:

Home	Zestimate	Renovation	Offer	Resale Price	Error	Zillow Profit
A	308	3.5	289	320	-3.75%	27.5
B	340.8	3.5	320
C	381.3	3.5	357
D	395.2	3.5	369
E	374.4	3.5	345
F	579.3	3.5	547	575	+0.75%	24.5
G	640.5	3.5	607	610	+5%	-0.5
H	712.8	3.5	676	660	+8%	-19.5
I	849.6	3.5	810	720	+18%	-93.5
J	982.8	3.5	940	780	+26%	-163.5
Total						-225

That’s not all. In mid-late 2021, the U.S. was experiencing widespread labor shortages, supply-chain disruptions, and material delays (lumber, appliances, fixtures). This means that Zillow was not able to renovate and turn around its housing stock as quickly as before, leading to:

Additional carrying costs: property taxes, insurance, utilities for the entire time period between purchase and resale.
More pricing volatility: even if Zillow’s Zestimate was close to the resale value at the time of purchase, it might not be anymore months later, when they were finally ready to sell.

Home	Zestimate	Renovation	Offer	Resale Price	Error	6-mo Carrying Cost (10%)	Zillow Profit (net)
A	308	3.5	289	320	-3.75%	32	-4.5
B	340.8	3.5	320
C	381.3	3.5	357
D	395.2	3.5	369
E	374.4	3.5	345
F	579.3	3.5	547	575	+0.75%	57.5	-33
G	640.5	3.5	607	610	+5%	61	-61.5
H	712.8	3.5	676	660	+8%	66	-85.5
I	849.6	3.5	810	720	+18%	72	-165.5
J	982.8	3.5	940	780	+26%	78	-241.5
Total							-591.5

In Q3 2021, Zillow told its shareholders that it had bought many homes for more than it now expected to sell them for. It had to “write down” the reported value of the homes it owned by $304 million, immediately recognizing those losses in its financial results. Soon afterward, Zillow stopped entering into new home purchase contracts, and eventually shut down Zillow Offers altogether.

This example highlights a common failure mode in applied machine learning: predictive accuracy does not directly translate into economic value.

Models are usually judged by how accurate their predictions are on average, but those predictions drive real decisions that are not exactly as bad or as good as the predictions! Even a model that is right on average can lead to consistently bad results.
Models tend to perform worst when conditions are changing or information is limited, exposing the system to extra risk that may not have been “baked in” to business logic.
Small prediction errors can grow into large and unpredictable losses.

You might be thinking that the failure of Zillow Offers is a business error, not an engineering error, and you’d be partly right. There was definitely a business error, and it is likely that Zillow Offers was pushed to grow volume and take on inventory to show investors it was “working”. But,

this didn’t help the Zillow Offers engineers avoid the layoffs, and
engineering decisions determined what the model was optimized for, how its outputs turned into bids, how uncertainty was handled - and there are engineering mitigations (that the business side wouldn’t even know to ask for!) that could have helped.

I didn’t work at Zillow so I can’t say which of these were implemented or not, but some possible mitigations would be:

Predict and use price bands. The system could output multiple price quantiles (e.g., 10th, 50th, 90th percentile) and base offers on the lower quantiles to limit exposure.

In addition, the system could suppress bids when the spread between upper and lower quantiles exceeds a threshold, since wide spreads indicate high uncertainty and more risk.

This would result in fewer offers being accepted, but hopefully, more profitable transactions.

Explicitly model acceptance bias. One way to handle the acceptance bias issue is to explicitly separate two questions. First: for a given home and offer, how likely is the seller to accept? Second: among the homes that would be accepted at that offer, what profit do we expect after resale and realized costs? With enough operational data, we can train an acceptance model directly from accepted vs rejected offers, and train a profit (or resale value) model on the homes we actually acquire. We then choose offers by trading off acceptance probability against expected profit conditional on acceptance. This is a more system-aware objective: it optimizes for what actually happens in the business, not for a hypothetical world where every offer is accepted.

Monitor the business metrics that matter + gate systems on them. For each purchased home, we should track two versions of profit:

Expected profit at decision time: what we thought we would make when we set the offer, based on the model’s predicted resale value and modeled costs (renovation plus expected carrying costs).
Realized profit: what we actually made after resale, based on the realized sale price and realized costs.

We want to monitor the profit gap: realized profit minus expected profit. If we aggregate that gap over a cohort (for example, a given city and month of purchases), a well-calibrated system should have an average profit gap near zero. If the average profit gap is persistently negative, then we are systematically overestimating profitability, even if prediction metrics look good. In that case, we should automatically tighten bid limits or pause bidding for that market when the profit gap passes some threshold.

Monitor + gate on drift in features and predictions. When the feedback loop is slow (sales take months!), you can still lose a lot of money while waiting for a lagging indicator like profit. The system can also be configured to pause or slow down buying when the distribution of input features and model predictions is changing relative to the data the model was trained on. Because these forms of drift often appear before realized losses, the system should treat them as early warning signals.

Building high-quality ML systems requires more than a model with good predictions. The model and the system surrounding it need to be aligned with incentives, decision rules, uncertainty, time, and many other elements.

1.3 Prototype vs. production

In this unit, we will explore the transition from ML prototype to production deployment, and why it can be so challenging to navigate. Then, in later units, we will learn tools and techniques for bringing an ML product to production in a way that avoids that “graveyard”. The tools and frameworks we introduce are not boxes to check; they don’t magically turn a prototype into a good product. What they do is make the hard parts of production ML visible, testable, and repeatable, so we can

notice problems early,
debug them when they happen, and
operate the system reliably over time.

When learning about these tools, we will try to make those problems visible first so we understand why the tool exists and how it helps. We will go component-by-component: cloud computing basics, platform and DevOps, data systems, training infrastructure, model serving, evaluation and monitoring, and safety and governance.

1.3.1 Models vs. systems

In a prototype, it is natural to treat the model as the product. In production, the model is just one component inside a larger system.

For example, imagine we want to detect fraudulent transactions. A prototype might be a notebook that trains a model on a labeled dataset. A production system has to:

ingest transactions from the payments system in real time
compute features consistently for training and serving
store labels that arrive later (chargebacks) and link them back to the original transactions
choose and justify a decision threshold (when to block, when to challenge, when to allow)
monitor for new fraud patterns and for harm to legitimate users
roll back when a model causes unacceptable false positives

The model code is a small fraction of what has to be built. A lot of the work is “plumbing”: data extraction, data cleaning, feature computation, integration, and operations.

“Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small black box in the middle. The required surrounding infrastructure is vast and complex.”. Image source: Sculley et al.

Besides the implications for development and maintenance effort, this is also important because of the dependencies and feedback loops that are typical in ML systems. In traditional software, dependencies are usually relatively easy to trace (for example, we can see what libraries are imported). In ML systems, some of the most important dependencies are data dependencies: a model depends on upstream data and feature computation choices that live outside the model code, and sometimes even outside the system itself. When those inputs change, model behavior can change even if we never touch the model. Even worse, the system can create feedback loops, where predictions change user behavior, and that becomes the new data we later train on.

Furthermore, once we start composing systems out of multiple models, these hidden dependencies are not just between a model and its data, but also between models inside the same product. In a prototype, we’re often working with a monolithic model: one set of features and target variable, one training run, one model artifact, one set of metrics. But in real ML-enabled products, we often split the work into stages, with different models responsible for different steps. Sometimes this is driven by operational constraints. If the most accurate model is too slow or too expensive to run on every request, we might use a fast stage to narrow the space and a slower stage to refine the result. For example, in recommendation systems, a common pattern is candidate generation plus ranking: a fast model selects a few hundred candidates from millions, and a slower model ranks those candidates. In other settings, we might add separate models for policy filtering (for example, unsafe content) or for routing borderline cases to a human reviewer.

Companies like Netflix and Uber runs hundreds or thousands of models in production, many of them interconnected. Companies whose product is not primarily technology might “only” have tens of models.

1.3.2 Predictions vs. decisions

Models produce predictions, but systems produce decisions or outcomes. A prediction is a number (a probability, a score, a class label). A decision is an action we take because we saw that number.

---
config:
  theme: default
---
    graph LR
      D[Data] --> M[Model]
      M --> P[Prediction]
      P --> S["System"]
      S -->|either| O["Decision or<br> outcome"]
      S -->|or| H["Human"]
      H --> O

For example, a credit model might output “probability of default in the next 12 months.” Then, an automated system or a human who sees recommendations from the system, will act based on this number, for example:

approve or deny the loan
ask for more documentation
set an interest rate
flag the application for extra human review

The decision is where most of the risk lives. This is also where we connect the ML system to

policy: what actions are allowed or required (for example, “we must not deny a loan automatically”)
product design: how the output is presented to users (for example: show a confidence score, offer multiple recommendations for next steps)
operations: what people do around the system (for example, who reviews flagged cases)

This distinction between “prediction” and “decision” is also why we can think of so many bad uses of ML that are not necessarily “bad predictions”. For example, imagine an AI detector that is advertised as “99% accurate.” That statement usually assumes a specific test set and a specific threshold that turns a score into a yes or no decision. In real use, the model might output something like “40% AI-generated” for a student essay (prediction). If a professor treats “40%” as sufficient to invoke academic integrity guidelines (decision), then harm can come from the threshold and policy, even if the model itself is well calibrated.

1.3.3 Offline metrics vs. operational and business metrics

Offline metrics are the metrics we compute during development, usually on a static labeled dataset (for example, accuracy or mean squared error). These are useful as a first indicator of model “health”, but they are not the outcomes we actually care about.

In a production system, we need additional categories of metrics:

Operational metrics: is the system fast enough, cheap enough, and maintainable enough to operate day to day?
Business metrics: is the system making money, saving money, or delivering user value relative to the baseline?

For example, we might build a model that gives better recommendations on the side of a product page, so offline ranking metrics improve. But the model might learn to recommend content that users click impulsively but regret later, increasing short-term engagement while decreasing long-term retention (business metric: retention). Or, running the model might add 100 ms of page load time (operational metric: latency), causing more users to abandon the page before adding anything to cart (business metric: add-to-cart conversion rate). Or, the model might be so expensive to serve that the incremental profit from extra purchases is smaller than the extra compute bill (business metric: profit margin).

Performance on these metrics can change what “best model” means: a slightly less accurate model that is much faster or cheaper might be a better choice than a more accurate model.

1.3.4 Static labeled data vs. dynamic data without labels

In a prototype, we often get a clean dataset with labels attached. In production, the data is live and messy: inputs arrive continuously from production systems and include missing fields, duplicates, edge cases, incomplete records, and other “pathologies” that we need to handle.

Production data is also dynamic. Sometimes the data stream changes in a very literal, operational way. For example, imagine a model that uses temperature from a weather API as an input feature. If the API changes its default units (Fahrenheit to Celsius), we can silently feed wrong numbers into the model. Or if the API stops responding (for example, billing fails) and our feature code fills missing temperature with NaN, the model might behave unpredictably. Other times, data changes when people’s behavior changes, when products change, when policies change, or when the environment shifts. This is often described as drift.

Labels or feedback signals are often especially challenging in a production system. Sometimes, our problem has a natural ground truth. For example, if we predict how long a customer will wait on hold, we can later measure the actual wait time. Other times, we can try to collect feedback from user actions. For example, a spam filter can add a “report not spam” button (this is a type of explicit feedback), or infer a correction when a user moves an email from the spam folder to the inbox (this is a type of implicit feedback). These labels can be sparse, biased (only some users respond), or low-quality (users do not always know the true label).

Some common “pathologies” include:

Labels arrive late: the true outcome shows up weeks or months after the prediction. Example: a loan default label might only be known after many months.
Labels are noisy: humans disagree or policies change, so labels are inconsistent. Example: content moderation labels vary by reviewer and by policy interpretation.
Some outcomes are never observed. Example: a medical triage model may never learn the true diagnosis for patients who leave without being evaluated.

We also have to consider that the possiblity of a feedback loop; the ML system itself can change the outcome we are trying to predict, and this in turn becomes part of the signal that is used as input to the model in the next training run. For example, suppose we predict which students are at risk of failing an exam in order to give them extra tutoring. At the end of the year, some students who got extra help pass. Was the prediction wrong, or did the intervention change the outcome?

One way teams handle this is by intentionally holding out a small fraction of cases as a holdout set so we can estimate the baseline outcome. But, this has a negative effect on the user experience.

1.3.5 Train once vs. continuous retraining

If the world changes and the data changes, then a model trained once will eventually become stale. Some systems can run with long update cycles on the order of months, but others need to be updated much more frequently.

Retraining is not just “run training again,” though, because when we retrain a model:

we need a clear trigger for when to retrain (for example, on a schedule, when performance drops, or when drift is detected)
we need to know what data the model trained on each time, and be able to reproduce it
we need to evaluate the new model on the right datasets, including new edge cases we learned about in production
we need a safe rollout and a rollback plan

The last item is especially important. When a model affects real users, mistakes become expensive. So when models are updated frequently, we also need a very robust process for evaluating and monitoring those models as we gradually roll them out to the user base, and mechanism for reverting to an older “safe” model if something goes wrong.

1.3.6 Code vs. pipelines

A prototype is often a script or a notebook that produces a single model “artifact”. A production ML system needs a pipeline: data ingestion, feature computation, training, evaluation, packaging, deployment, and monitoring.

The reason is practical: once a model is in production, it must be retrained regularly as data and conditions change, and this cannot depend on manual, ad-hoc human effort each time. If retraining requires a person to rerun notebooks, copy files, or tweak settings by hand, the process will be slow, error-prone, and impossible to scale or operate reliably.

Automation level zero: manual steps and a fragile model handoff.

Improving on level zero: source-controlled, automated pipelines with triggers and monitoring.

This shift in automation level changes what the ML team delivers. In a prototype, the deliverable is “a trained model.” In production, the deliverable is source code and configuration that define an automated pipeline that can repeatedly produce, evaluate, and deploy new models from changing data without requiring human intervention at every step.

This is also where the system starts to fail in ways that feel “boring” but are extremely common. The most common failure modes for ML systems in production are not necessarily even ML-related.

Outage causes by MLness (from a study of 10+ years of detailed postmortem writeups for one of the larger ML systems at Google).

For example, in one outage of a major ML pipeline at Google, a data source was copied to a new location to save resources, and the pipeline did not have permission to read the new location. The result was that the entire pipeline lost the ability to process new data. This was not an ML bug, but it took down the whole ML capability.⁴

1.3.7 Individual work vs. team-owned systems

A prototype is owned by one person, but a production system will almost always be owned by a team.

This forces us to move from personal artifacts to shared infrastructure. When one person owns a notebook, it is easy to keep the context in their head, keep local datasets, and rerun things locally. When a team owns a system, we need shared infrastructure: documentation, version control, reproducible environments, shared datasets with defined access, and shared observability.

This is partly about scale and convenience, but it is mostly about making change safe. If we cannot reproduce a training run, we cannot debug it. If we cannot trace where a feature came from, we cannot fix it when upstream data changes. If we cannot roll back a model quickly, we cannot recover from failures.

1.3.8 Building for one vs. designing for scale

At small scale, we can download a dataset to a laptop, train a model on one GPU, and manually inspect edge cases. At large scale, that’s just not going to work.

For example, imagine our training data is 100 TB of images and logs. We cannot keep it on a single developer machine, and we cannot even fit it on a single conventional disk. Now training depends on distributed storage and data pipelines.

Scale can break the training loop itself. Suppose we upgrade from a small model to a larger model that does not fit in a single GPU’s VRAM. We need tricks to reduce the memory required for training, or we may need to distribute training across multiple GPUs.

With large scale systems, we are likely training many different models, so we need new tools to manage and schedule all these training jobs, usually on shared infrastructure.

Serving also changes at scale. Instead of one model running on one machine, we might have dozens or hundreds of serving instances behind a load balancer. Each prediction often depends on additional context and data that lives somewhere else, and needs to be “delivered” to the right serving instance.

Finally, scale makes reliability and cost visible in day-to-day terms. If 0.1% of requests have a malformed field, that might be one weird example in a notebook, but it can be thousands of failures per day in production. If a model adds only a few cents of compute per request, that can still become millions of dollars per month at high traffic.

1.4 Case study: Booking.com

Let’s look at another case study - a positive one this time. Booking.com, the world’s largest online travel marketplace, published a report in 2019 explaining how it uses 150 machine learning models in production, affecting many parts of the user experience.⁵

System of models plus decision logic. The ML models shape what users see and how information is presented through ranking, filtering, and interface choices. For example, different models learn relationships such as:

Traveler preference models: infer latent preferences (price sensitivity, importance of luxury, destination flexibility) from browsing and booking history.
Context models: predict the trip type (business, family, weekend, long stay) from session-level signals.
Ranking models: estimate the probability that a user will click or book a specific property.
Content curation models: summarize reviews or descriptions.
Content augmentation models: estimate whether a property is “good value” or whether prices are rising or falling.

A model might learn a specific conditional relationship, such as:

\[ P(\text{book} \mid \text{user}, \text{property}, \text{context}) \]

(where $\text{user}, \text{property}, \text{context}$ can themselves be embeddings learned by a model!) and feeds its output into a larger system that decides:

which properties to show,
in what order,
with what supporting information,
and in what visual form.

Offline metrics vs. business metrics. Booking found that improving offline metrics (such as log loss or AUC) often did not produce measurable business gains. A model could be better on offline metrics while having no effect on conversion (the important business metric), or even harming it.

Relative difference in a business metric vs relative performance difference between a baseline model and a new one.

This happens for several reasons:

Sometimes model improvements occur where the old and new models already agree, so rankings do not change. (For example: both models rank a property highly, 95% vs 93% - in either case, it is listed first.)
Models are trained on proxy labels (clicks) that are imperfect substitutes for business goals (completed bookings). (For example: showing too many similar properties might make the user click on them, then get frustrated by trying to decide which one to book, and give up.)
Improvements may only affect a small segment of users.
Latency and operational costs can offset gains from higher accuracy.

Because of this, they treated offline metrics as filters, not success criteria. A model with better offline performance was only considered a candidate. The final evaluation was always done using randomized controlled experiments that measured business outcomes such as bookings and revenue.

It would not be possible to assign individual bookings to specific model predictions, since many models influence the same user journey (ranking, UI, content, and context models all interact). Instead, they attribute outcomes through experiment design. For each model, they run randomized controlled trials in which one group of users sees the current version of the system and another sees the system with that model changed. The model’s impact is measured as the difference in business outcomes between these groups:

\[ \Delta = \mathbb{E}[\text{bookings} \mid \text{new model}] - \mathbb{E}[\text{bookings} \mid \text{old model}] \]

This isolates the effect of that model’s decisions, even though no single prediction can be directly linked to a booking.

However, the nature of their system meant that they had to design these experiments very carefully, to make sure that the “signal” was not lost in the noise.

Selective triggering to avoid dilution. A common practical problem is dilution: the model might only be invoked for some requests (for example, only when enough features are available), or the model change might only affect a small subset of traffic. If we randomize over all users, the measured effect can look like noise because most of the “experiment sample” never experiences a difference. One way to address this is to keep the user-level randomization, but only apply the treatment when the model is available and actually triggered for that request. Then we analyze impact on the triggered subset.

Experiment design for selective triggering.

Triggering on model output. Another pattern is that the system only acts when the model output meets some criteria (for example, a score crosses a threshold). In that case, we can design experiments so that one treatment group applies the model-driven action when triggered, while another group invokes the model but does not apply the action. This lets us separate “the model ran” from “the model changed behavior” when estimating business impact.

Experiment design for model-output dependent triggering and control for performance impact.

Comparing models where they disagree. If two models produce identical outputs for most traffic, we won’t get much of a signal from comparing their outputs most of the time. One way to get a higher-signal comparison is to trigger the experiment only on requests where the models disagree, and then route one treatment group to model A’s output and another to model B’s output. This focuses the experiment on the slice of traffic where model choice can actually change the user experience.

These patterns are still randomized experiments, but they are designed around the fact that ML systems often have conditional logic and many interacting components. They are necessary in order to measure business impact where the model actually makes a difference, not average away the effect across a large amount of unaffected traffic.

In addition to comparing models directly, they also did experiments to separately isolate the effect of operational metrics. For example, they ran an experiment in which they artificially increased the latency of a model used in ranking and found that a roughly 30% increase in latency reduced conversion by about 0.5%. Even if a slower model was more accurate offline, the added delay could make the system worse on balance.

This result led them to implement many techniques to reduce the latency of their ML system, in addition to preferring models with lower latency!

Many Booking models operate in regimes where true labels are delayed or noisy. For example, whether a user finds a review helpful or whether they cancel a booking may only be known weeks later. In some cases, the system never observes a clean label at all.

To deal with this, Booking monitored not only business metrics but also the distribution of model outputs in production. For binary classifiers, they found that healthy models often produce smooth, bimodal output distributions, with mass near 0 and 1. When distributions collapsed toward 0.5, became jagged, or shifted abruptly, this was an early warning that the model was operating outside its trained regime.

Examples of Response Distribution Charts.

This helped address the issue of limited and delayed feedback - if there is a major issue, there is an early signal.

This case illustrates what it looks like to build a production ML system where models are components, predictions are inputs to decisions, and success is measured by real-world outcomes. Note that the goal is not to build a perfect model once, but to operate a system that can adapt over time.

1.5 Designing ML products and features

All of these ways in which an ML prototype and ML system are different stems from one fundamental distinction: the ML system, unlike the prototype, is meant to create real value for people in the real world, as part of a product or service. So, the first part of the design of an ML system is the design of that product or service. Once the product design clarifies “who, what, why”, then we can attempt a system design which answers “how”.

1.5.1 The product informs the system

When we design just a model, it can be OK to be a little bit non-specific about the product. In many cases, the same modeling choices can apply across a range of applications. But we have to be specific when we design the overall ML system, because the design will depend on the context of the entire system.

For example, consider “recommending books to a user”. We might want book recommendations in these two very different cases -

making recommendations to a user of a large online audiobook service
recommending books to users of a public library

The modeling problem might look similar in both cases - we get some input about the user’s past reading activity and we want to identify other books the user will like - but the system design is very different:

Question	Online audiobook service	Public library
How fast do things change?	Fast: new sessions, clicks, and purchases arrive continuously, and the home page update per request.	Slower: we usually get new information when user visits the library and checks out books.
What data do we have available?	Fine-grained interaction logs (search, clicks, listens, refunds), plus device and session context.	Borrow history and holds, plus some context (for example, preferred branch location).
Operational requirements?	Low latency and high availability: recommendations must return quickly on every page load.	Fits library workflow: e.g. batch-generated recommendations for a weekly email, recommendations printed on user’s checkout receipt.
What metrics matter?	Business metrics like conversion to purchase, completed listens, retention, and refunds.	Patron metrics like checkouts, holds, renewals, and long-term engagement with the library.
What is the decision?	What to show on a home page.	What to email, or what to suggest during a checkout interaction.
What are the privacy and risk constraints?	User history can be sensitive.	Policies and legal constraints may limit use of patron data.

As illustrated by this example,while “recommend books” is a modeling task, the system design problem should be “recommend books for this product, for these users, with these constraints.”

With this in mind, let’s spend a little bit of time talking about ML product design, which sets that context. This part of the ML system lifecycle is usually led by a product manager (possibly with other non-engineering team members), but ML engineers are involved in order to advise on what is feasible, measurable, and safe.

1.5.2 What can be an ML product?

First of all, let’s broaden our perception of what can be an ML product. We will describe a few different ways in which a product or feature can use ML.⁶

Critical vs. complementary. Our first instinct is probably to think about a totally new product that is built around an ML model, like:

An AI tutor that teaches a subject, generates practice problems, and adapts lessons based on mistakes
A radiology decision support product that reads chest X-rays and returns findings
A claims processing product that reads medical bills and decides approve/deny/request-more-info

But, some of the most successful uses of ML come when ML features are integrated into existing non-ML products, to add value to something that is already used by an existing audience. For example:

A popular open-source media player adds live AI-generated speech-to-text subtitles in any language (this has been in progress in VLC for a couple of years…)
A photo gallery app adds automatic duplicate cleanup and “best shot” selection
A collaborative document editing tool adds grammar/style suggestions powered by AI

In the first group of products, the ML feature is critical - the product or service doesn’t exist without ML. In the second group of products, the ML feature is complementary, so that the underlying product or service is still valuable.

Critical vs complementary ML features

Screenshot of an AI customer support chat assistant drafting a response to a user request.

Critical ML feature: the product is an AI agent handling customer support requests; without the model, this specific product doesn’t exist

Screenshot of a photo gallery feature prompt to automatically group similar photos into stacks.

Complementary ML feature. the photo gallery is valuable without ML; the model adds an optional organization capability (grouping similar photos and selecting the best shot for the “top” of the stack).

Automation vs. augmentation. Another key distinction is whether the ML system is meant to replace a human step or support a human step. In an automation product, the system does the work and the human mostly sees the outcome. For example:

An insurance workflow automatically approves simple claims and routes suspicious ones to investigation
A bank automatically blocks some transactions that match fraud patterns and challenges others with extra verification
A content moderation system automatically removes clear policy violations at high volume, without waiting for a human reviewer

In an augmentation product, the system suggests, drafts, or highlights, but a person remains responsible for the final decision. For example:

A radiology tool highlights regions of a chest X-ray to double-check, while the radiologist writes the report
A customer support assistant drafts a reply and pulls relevant policy snippets, while the agent edits and sends
An accountant uses an invoice extraction tool that fills in fields in a form, which the account reviews and edits before saving

Automation can be faster and cheaper, but it usually needs stricter guardrails and safer fallbacks. Augmentation can be safer in high-stakes settings, but it depends on the design: people need to understand uncertainty and how to correct mistakes.

Automation vs augmentation

Screenshot of an email client showing a message routed to spam with a warning banner.

Automation. the system takes an action (route to spam / warn) without asking first; the user mainly sees the outcome and has a recovery path (e.g., “Not spam”).

Screenshot of a notebook editor showing an inline code completion suggestion.

Augmentation. the system proposes an in-context draft (a suggested completion), but the human stays in control by accepting, editing, or ignoring it.

Visible vs. invisible. We should also think about both visible and invisible ML features. In a visible ML feature, users directly see some version of the ML model output. For example:

A streaming app shows “Recommended for you” rows
A keyboard shows next-word suggestions as we type
A photo app shows auto-generated tags or captions we can accept or edit

In an invisible ML feature, the system silently changes behavior to improve an experience, but users may not even be aware that ML is used. For example:

A camera app automatically improves photos (noise reduction, HDR, low-light enhancement) without asking
A video conferencing app removes background noise and echo automatically
A map app silently corrects GPS drift and snaps location to the road to improve navigation

Visible vs invisible ML features

Screenshot of an e-commerce UI showing a 'Compatible items' recommendation row.

Visible ML. the model output is user-facing (e.g., a recommendation row like “compatible items”), so users can directly judge and interact with it.

Screenshot of a Zoom setting for background noise removal.

Invisible ML. the model changes system behavior (audio cleanup) and the value is in the improved experience, not in a prediction surfaced to users.

Proactive vs. reactive. We should also think about whether an ML feature is reactive or proactive. In a reactive feature, the system waits for us to ask a question or take an action, and then uses ML to help in that moment (for example, when we type, search, or upload something). For example:

Search autocomplete suggests queries as we type
A support assistant drafts a reply when a customer service agent opens a ticket
A shopping site shows “similar items” after we click into a product page

In a proactive feature, the system initiates: it surfaces suggestions, alerts, or actions without an explicit request. For example:

A video streaming app sends a push alert telling the user that a TV show they may like starts next week
A phone shows “time to leave” notifications based on calendar events and current traffic
A grocery delivery app notifies the user that they may be low on milk based on typical consumption and time since their last purchase, and offers to add it to their cart

Reactive vs proactive ML features

Reactive ML. the system responds to a user-initiated action (typing a query), and the ML output appears in that interaction.

Screenshot of a phone notification for Google Maps showing a 'time to leave' alert.

Proactive ML. the system initiates an alert from context (calendar + traffic) without an explicit request in that moment.

These different types of ML features also have different requirements and challenges in system design. The critical or complementary “axis” makes a big difference especially with respect to how reliable the ML feature needs to be. Visible features let users form an opinion and provide feedback through interaction, but invisible features can be harder to evaluate because users may not even know they exist. Proactive outputs can feel intrusive or distracting if they are low quality, so they usually require higher precision, better context, and clearer ways to dismiss or control them.

1.5.3 Monetization

The first stage of ML product or feature design is also a good time to think about the economics of the feature. In a prototype, we are not really concerned with monetization, meaning how the system creates revenue or reduces costs. But in a production system, we are very concerned with this.

If our product is a new system built around the ML model, we have a few potential approaches to monetization -

we might try usage-based pricing.
or we might use a subscription model with tiers, in which we might offer an AI feature as a premium add-on or cap how much usage is included.

Monetization models

Screenshot of a metered AI pricing or usage UI for a coding assistant.

Usage-based pricing. customers pay per unit of usage (e.g. requests).

Screenshot of a pricing page with subscription tiers for an AI voice product.

Subscription tiers. plans bundle access (often with included usage).

These decisions also have implications for system design - when we choose usage-based pricing versus subscription tiers, we create different incentives and constraints that shape the entire system:

Usage-based pricing creates pressure on the user to minimize number of requests:

users are incentivized to batch requests, cache results, and reuse computations.
they would appreciate rate limiting or quotas to prevent abuse or runaway bills.
the system must accurately meter and log every use case for billing.
pricing transparency becomes critical: users need to understand what they will be charged.

Subscription tiers create different tradeoffs:

users are less concerned with minimizing requests, which can increase our serving costs.
we have to optimize for user satisfaction and retention: keeping subscribers active and happy.
we need to track usage to enforce tier limits, but not for billing.
the system needs monitoring to detect when a tier boundary is too restrictive or too generous, since dissatisfied users will leave.

Hybrid models are also common: a free tier or “freemium” model (with limited usage), usage-based overage charges, or tiered subscriptions with per-unit costs above a threshold.

When an ML feature is only complementary to the overall product, it still has to justify itself. In indirect monetization, we might never put a price on the AI feature at all, but we still need to show that it e.g. increases subscriptions, drives more purchases, reduces time spent by customer support (and therefore cost), or it keeps users engaged so that they view more adds.

Of course, in either case, we need to make sure the unit economics work out, considering the cost of the product or feature.

Unit Economics Sketch

Pick a clear unit (for example: per request, per document, per user per month) and estimate what contributes to profit per unit vs cost per unit. Some examples of profit and cost elements -

Profit per unit (value)	Cost per unit
Price charged directly to users	Compute costs
More people buy products (conversion)	Data storage
More people keep their subscriptions (retention)	Human review time
Less staff time needed (cost saved)	Refunds / chargebacks / mistakes
	Build cost, amortized (engineering time / units over time)
	Maintenance cost, amortized (on-call, fixes, retraining)
Margin per unit = (profit per unit) - (cost per unit)

1.5.4 Differentiation

For a product that is offered directly to consumers, it is also important to think about differentiation - how the product or service will appeal to customers to choose it over the competitors. Many AI products look similar on the surface, especially when they are built on top of the same foundation models. A useful way to think about differentiation is to separate three broad sources of advantage in AI products: technology, data, and distribution.⁷

Technology advantage means you can do something meaningfully better on the same task: higher quality, lower latency, lower cost, better reliability, better safety, better UX integration, or capabilities competitors cannot easily replicate. This can come from model improvements (fine-tuning, distillation, routing, tool use), system improvements (caching, batching, fallbacks), or domain-specific product design that makes the output more useful.

Data advantage means you have access to signals others cannot easily get: proprietary labeled examples, domain context, high-quality corrections, and usage feedback loops. Even when user data cannot be used to train model weights directly, it can still improve the system (for example, better evaluation sets, prompt templates, routing rules, and guardrails) over time.

Distribution advantage means you can reach users and fit into workflows more effectively: an existing installed base, default placement in a platform, trusted brand, or deep integration into the place the work already happens (email, docs, IDEs).

Differentiation examples (technology, data, distribution)

Screenshot of an AI coding assistant UI.

Technology advantage. better model + better agent/system behavior (tool use, planning, edit loops) can make the same task meaningfully faster or higher quality, even without proprietary data or built-in distribution.

Screenshot of a legal research product UI.

Data advantage. a legal AI assistant can be differentiated by licensed/proprietary data plus decades of editorial enrichment (citations, taxonomy), which competitors can’t easily replicate.

Animated screenshot of an AI writing feature inside a document editor.

Distribution advantage. the AI feature ships inside an existing workflow (Docs), so it can reach users by default without requiring them to adopt a separate product.

Some new AI products today are “thin wrappers” around a foundation model: the application does not update model weights at all, and the core logic is a prompt plus a system that calls an LLM and returns the output. This can be a great way to prototype quickly. But low barrier to entry means: if it is easy for us to build, it is also easy for competitors to copy, or for it to be implemented as a feature inside a bigger product where it will effectively capture the entire market. That is why, for many teams building these “wrappers”, the most realistic long-term advantage is often a data advantage: getting to market early and collecting usage signals (corrections, failures, and patterns of use) that help you iterate the system and product over time.

Building on top of foundation models also means we are effectively providing a layer on top of a rapidly improving base. If the underlying model expands in capabilities, it can make our application obsolete. For example, a PDF parsing product might be compelling if it is built on the assumption that general-purpose chat models cannot reliably parse PDFs or cannot do it at scale, but it becomes harder to defend if models improve and that assumption stops being true. (Even then, we can build around open models and self-hosted deployments and possibly still appeal to users who need to host models in-house for privacy, compliance, or cost reasons.)

LLM wrapper products

Screenshot of a meeting transcription and summary app.

Wrapper risk: a standalone app (like Otter.ai) can lose differentiation when the incumbent meeting platform bakes transcription/summaries into the product.

Screenshot of an incumbent meeting platform showing an AI summary or assistant feature.

Distribution advantage: the incumbent (Zoom) can ship an AI summary feature to massive audience without users adopting a separate tool.

1.5.5 Specifying the design

With all these product design considerations in mind, let’s lay out the product or feature details that we will want to specify in order to then move on to system design. To design a useful ML product or feature, we should frame this in a human-centered way: who will use it, what are their needs, and how can we support those needs? ⁸

Background. Identify the user, the goal, and the current workflow. Look for the intersection of user needs and ML strengths (where rules are brittle and manual work does not scale).

users: who uses it, and in what context?
goals: what outcome do they want?
pains: what is currently hard, slow, or error-prone?
workflow: what happens step by step, and where does a decision happen?

Value proposition. State what we are building, why it helps, what the baseline would look like without ML, and why ML is potentially justified (instead of a rules-only or manual solution).

product: what are we shipping?
why it helps: what pain does it remove, and what gets better for the user?
baseline: what happens today without the ML feature? What would a non-ML version of the feature look like?
why ML: what is hard to do with fixed rules or manual review?
unique value: what does ML let us do better (or cheaper) than the baseline?

Solution. Specify the behavior, constraints, and what we optimize.

behavior: what does the feature do on the “happy path”?
role: what does the system do vs what does the human do (preview, edit, undo, escalate)?
constraints: cost, latency, privacy, safety.
uncertainty: what happens when the model is unsure?
reward function: what do we optimize for? what types of errors can occur, and what are their costs?

Feedback. Define how the system learns from real use, including explicit feedback, implicit feedback, and how corrections flow back into training and evaluation.

explicit feedback: what expicit signals can we get from users?
implicit feedback: what implicit behavior signals do we log that help us understand real use?
corrections: what changes immediately when feedback is received?

Feedback mechanisms

Feedback can come in many forms! Here are some products to show how feedback can be explicit or implicit, positive or negative, immediate or delayed.

Explicit +/- label (translation): Google Translate thumbs up/down. Explicit, immediate, signed feedback (+/-) on a model output. Often used to route examples for review and to improve evaluation/training over time.	Explicit label correction (discriminative): Gmail Spam/Not spam. Explicit correction of a classifier decision (negative vs positive label). High-value training signal, but biased toward users who bother to correct.
Explicit accept/reject (generative): Gmail Smart Compose (also has an implicit productivity signal: time spent composing emails). Explicit interaction with generated text (accept, ignore, edit). Acceptance rate is a natural feedback signal for generative features.	Explicit same/different label (clustering): Google Photos face grouping. Explicit verification label for a clustering/identity task (same person vs different), collected only on ambiguous cases.
Positive label (ranking): Add to cart. Stronger signal than clicks, but still context-dependent and biased by what was shown.	Recovery signal (generative): undo after rewrite. Implicit negative/correction signal: undo/revert indicates the model output harmed the task. Also makes mistakes recoverable.
Delayed negative label (shopping): returns/refunds. Delayed negative feedback: the label arrives after the decision (returns, refunds). Signal for long-term quality but slow to learn from and easy to misattribute.	Implicit engagement signal (feed ranking): stop/slow scrolling. Implicit feedback used as a weak label (watch time, replays, scroll speed).

Success. Define success metrics and failure modes. Include second-order effects (how people might adapt once the feature exists).

success: what would we measure to decide it worked? (note that proxy metrics can lie. For example, “more clicks” can go up because headlines got more sensational, even if trust and long-term retention go down.)
failure: what goes wrong?
guardrails: what thresholds cause the system to slow down, fall back, or stop?
“perfect optimization” test: if we optimized our reward function perfectly, what bad behavior would it encourage? (For example, if a marketplace ranks listings by predicted conversion, sellers may change titles and photos to trick the ranking model, increasing returns and complaints.)

Feasibility. Check whether we can build and measure this safely with the resources we have.

data: do we have labeled examples that match production, including the hard cases?
measurement: can we measure success and detect failure in production?
cost: can we afford training and inference at our scale?

1.6 Hypothetical example: GourmetGram

In the rest of this course, we will return to a running hypothetical product called GourmetGram: an existing online community where people share food photos (currently, without ML). Now, suppose we want to implement an automatic category tagging feature. Let’s write out the product design using the sytem described above.

Background (user, goals, pains, workflow).

users: posters (uploaders), viewers (browsers), and moderators.
goals: browse by category (for example, ramen, vegan desserts) and keep the feed on-topic.
pains: categories are optional and manual today, so many posts end up untagged; free-text hashtags are inconsistent (“vegan”, “plantbased”, “vegann”), making category pages/search noisy; some users game tags for reach; moderators spend time on obvious off-topic uploads that could be caught earlier.
workflow (today): upload photo -> optionally pick a category / add hashtags -> post enters feed/search -> viewers browse feed + category pages -> moderators handle reports and remove off-topic content.

Value proposition (what we ship and why it helps).

We will add one ML-backed feature: automatic category tagging at upload time. For each uploaded food photo, the system suggests a category from an approved list (for example, pizza, sushi, salad, cake). This reduces missing/inconsistent tagging, improves category browsing and search filters, and helps route likely non-food uploads to moderation review.

A rules-based tagging approach (for example, keyword matching on captions/hashtags) is not suitable. The signal is mostly in the image, which doesn’t lend itself to a rules-based approach.

Solution (behavior, constraints, uncertainty).

behavior: predict a single category for each uploaded image.
constraints: keep latency and cost low enough to run on every upload.
uncertainty: if confidence is below a threshold, require the user to pick a category (with suggestions).

Feedback (what signals we learn from).

explicit: user changes the suggested category; moderator marks “not food” / “off-topic” or removes the post.
implicit (downstream): engagement after the category is applied (do viewers click into the category page, keep browsing within it, save/share, or quickly bounce?). These signals are only weak evidence about correctness: low engagement might mean the tag is wrong, but it can also mean the photo is low quality or the category itself is unpopular; high engagement might mean the tag is right, or just that the photo is interesting.
corrections: user/moderator corrections update the displayed category immediately and are stored for retraining and audits.

Success (metrics, failure modes, guardrails).

success: higher share of posts with a valid category; more usage of category pages/filters; fewer moderator actions on off-topic uploads; lower rate of post-launch category edits.
failure: confidently mis-tagging (misleading category pages); over-reliance on automation causing users to stop tagging carefully.
guardrails: if low-confidence rate rises or off-topic removals rise, stop auto-applying and switch to “suggest + require user pick”.

Feasibility. Since we already operate this system, we have data available on which to train. We can use moderator-reviewed samples with “approved” category labels. A typical sample might look like this:

{
  "post_id": 184225,
  "created_on": "2026-01-07 19:14:22",
  "user_id": 91273,
  "image_id": "img_6f3d2a9c",
  "caption": "Made these spicy tuna rolls at home for the first time",
  "category": "sushi"
}

1.7 Review

The most important mindset shift in designing ML systems is to treat the model as just a part of a decision-making system. The model produces a prediction, but the system turns that prediction into an outcome. “Good predictions” can still lead to bad outcomes.

This also changes how we evaluate and operate ML. Offline metrics are useful, but they do not capture operational constraints (cost, latency, on-call burden) or the business outcomes we care about (retention, conversion, profit). Production data is live and messy, labels are delayed or biased, and pipelines fail for boring reasons. More than any accuracy number, the case studies in this chapter show that end-to-end impact, monitoring, and careful rollouts matter for success.

In the next units, we will introduce tools and frameworks for the ML product lifecycle, that address some of the specific problems we have mentioned (scale, versioning, monitoring, etc.).

1.8 Key terms

monolithic model: one-model prototype design.
operational metrics: whether the system is runnable day to day.
business metrics: whether the system delivers value relative to the baseline.
drift: when the world changes so the model’s assumptions stop holding.
natural ground truth: a label we can measure directly after the fact.
explicit feedback: direct corrections or ratings.
implicit feedback: behavior signals used as labels or training data.
feedback loop: when interventions change the data we later learn from.
holdout set: a control slice to estimate the baseline outcome.
lagging indicator: a metric that often updates too late to prevent the problem.
rollout: a safer way to launch changes.
rollback: going back to a safe version.
pipeline: the repeatable steps that produce and deploy models.
automation: delegating work to the system.
augmentation: helping a person while keeping them in control.
usage-based pricing: pricing tied to usage volume.
tiers: feature packaging that limits and prices access.
indirect monetization: value via an existing revenue channel.
unit economics: whether per-unit value exceeds per-unit cost.

Aditya Challapally, Chris Pease, Ramesh Raskar, Pradyumna Chari. July 2025. “State of AI in Business 2025 Report.” https://mlq.ai/media/quarterly_decks/v0.1_State_of_AI_in_Business_2025_Report.pdf ↩︎
D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. 2015. Hidden technical debt in machine learning systems. In Proceedings of the 29th International Conference on Neural Information Processing Systems - Volume 2 (NeurIPS’15), Vol. 2. MIT Press, Cambridge, MA, USA, 2503–2511. https://papers.nips.cc/paper_files/paper/2015/hash/86df7dcfd896fcaf2674f757a2463eba-Abstract.html ↩︎
Florent Buisson. “Is Zillow ‘Cursed?’ A Behavioral Economics Perspective.” November 23, 2021. https://medium.com/data-science/is-zillow-cursed-a-behavioral-economics-perspective-5b5165bb085b ↩︎
Daniel Papasian and Todd Underwood. 2020. How ML Breaks: A Decade of Outages for One Large ML Pipeline. In Proceedings of the 2020 USENIX Conference on Operational Machine Learning (OpML ’20). USENIX Association. https://www.usenix.org/conference/opml20/presentation/papasian ↩︎
Lucas Bernardi, Themistoklis Mavridis, and Pablo Estevez. 2019. 150 Successful Machine Learning Models: 6 Lessons Learned at Booking.com. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’19). Association for Computing Machinery, New York, NY, USA, 1743–1751. https://doi.org/10.1145/3292500.3330744 ↩︎
Apple Developer Documentation. 2023. Human Interface Guidelines: “Machine learning.” https://developer.apple.com/design/human-interface-guidelines/machine-learning ↩︎
Chip Huyen. 2025. AI Engineering: Building Applications with Foundation Models. O’Reilly Media. ISBN: 978-1098166304↩︎
Google PAIR. People + AI Guidebook. “User Needs + Defining Success” (Chapter 1).Published May 8, 2019. Third edition updated April 2025. https://pair.withgoogle.com/guidebook/↩︎