Geo-selection and national-level data

Geo selection

When you are selecting geos, consider the following guidance:

  • Drop the smallest geos by total KPI first. Smaller geos have less contribution to ROI, yet they can still have a high influence on model fit, particularly when there is a single residual variance for all groups (unique_sigma_for_each_geo = False of ModelSpec).

  • For US advertisers using designated market area (DMA) as the geographical unit, a rough guideline is to model the top 50-100 DMAs by population size. This generally includes the vast majority of the KPI units, while excluding most of the noisier small DMAs that might impact model fit and convergence.

  • When each geo has its own residual variance (unique_sigma_for_each_geo = True of ModelSpec), noisier geos have less impact on model fit. However, this option can make convergence difficult for some datasets because it adds so much flexibility to the model. If MCMC sampling does converge under this option, it might be worth plotting the geo population size versus the mean residual standard deviation (sigma parameter) - in most cases, you would expect to see a fairly monotone pattern. If you don't see this pattern, then it might be better to set unique_sigma_for_each_geo = False and use a smaller subset of geos.

If you want to make sure the model represents 100% of your KPI units, you can aggregate smaller geos into larger regions. However, this option comes with several caveats:

  • Recognize that geo-level modeling is a big advantage and this advantage grows with the number of geographically separated treatment units. For more information, see National-level versus geo-level modeling.

  • Different geo aggregation grouping methods can lead to different MMM results.

  • Media execution variables, such as impressions or cost, can usually be summed across geos. However, some control variables, such as temperature, can be less straightforward to aggregate.

National-level media in a geo-level model

When most media are available at the geo-level, but one or two are only available at the national level, we recommend imputing the national-level media at a geo-level and running a geo-model. One naive imputation method is to approximate the geo-level media variable from its national level value, using the proportion of the population in the geo relative to the total population. Although it is preferable to have accurate geo-level data so that imputation isn't necessary, imputation can still yield useful information about the model parameters. For more information, see section 4.4 of Geo-level Bayesian Hierarchical Media Mix Modeling.