Case study · Predictive modeling

ProphetCMA — Real Estate Price Prediction

A machine-learning approach to automating Comparative Market Analysis (CMA), built around ensemble modeling and honest error reporting.

MAE

6,776

RMSE

11,166

Blend

15 / 85

XGBoost / GBM

Stack

R

caret + xgboost

Problem

Pricing homes from messy comps

Comparative Market Analysis is the manual process real-estate agents use to price a home from recent comparable sales. It’s slow, expert-dependent, and uneven across price tiers. The goal: an ML model that estimates fair market price from objective property features (square footage, beds/baths, acreage, property type) so the human stays in the loop on judgment, not arithmetic.

Approach

Compare, blend, then audit the errors

Trained and compared five supervised models on a cleaned dataset of residential properties: Linear Regression, k-Nearest Neighbors, Random Forest, Gradient Boosting Machines (GBM), and XGBoost. Evaluated on Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) under cross-validation.

The winning model was a blend — 15% XGBoost + 85% GBM — reached via grid search over blend weights. The blend smoothed XGBoost’s aggressive splits while keeping its sensitivity to feature interactions.

Result

Good signal, with clear boundaries

MAE comparison chart across model variants
  • Optimized Blend MAE: 6,776.11
  • Optimized Blend RMSE: 11,166.11
  • Best performance concentrated in the $130k–$185k band where comp density is highest.

Limitations

Where the model got honest

  • Clustering as preprocessing hurt accuracy. Grouping by price-derived features caused class imbalance above $185k and produced erratic predictions for higher-end homes.
  • Overfitting concentrated in the middle band. Strong early scores were partly an artifact of dense data in the $130k–$185k range; generalization above and below that range was weaker.
  • Feature gaps mattered. Dataset lacked year built, renovation history, garages, pools, and condition grades — all of which materially move price. The model under-predicted homes with significant unmeasured upside.

Next

How I would push it further

  • Source a richer dataset that includes year built, condition, and renovation history.
  • Replace blanket clustering with hierarchical models per price tier.
  • Add SHAP-based explanations so the agent sees why a price came out where it did.