generate_maxdiff_data#
- pymc_marketing.customer_choice.synthetic_data.generate_maxdiff_data(n_respondents=200, n_items=20, n_tasks_per_resp=12, subset_size=4, true_utilities=None, sigma_respondent=0.6, item_correlation=None, items=None, random_seed=None)[source]#
Generate synthetic MaxDiff (best-worst scaling) data.
Simulates a MaxDiff survey where each respondent sees
n_tasks_per_resptasks, each showing a randomsubset_sizeof items drawn uniformly from the full pool ofn_items. The respondent picks the best and worst items from the subset according to the Louviere sequential best-worst model.- Parameters:
- n_respondents
int, default 200 Number of respondents.
- n_items
int, default 20 Full item pool size.
- n_tasks_per_resp
int, default 12 Tasks shown per respondent.
- subset_size
int, default 4 Items shown per task (must be
<= n_items).- true_utilities
np.ndarray, optional Ground-truth item utilities of length
n_items. If None, drawn fromNormal(0, 1). The last item’s utility is shifted to 0 to match the default identification constraint.- sigma_respondent
float, default 0.6 Scale of per-respondent item-level deviations (standard deviation). Set to 0 for a homogeneous-preferences population.
- item_correlation
np.ndarray, optional Shape
(n_items, n_items)correlation matrix for the per-respondent utility deviations. Must be symmetric, positive semi-definite, with ones on the diagonal. When supplied, respondent deviations are drawn fromMVNormal(0, diag(σ) @ item_correlation @ diag(σ)); otherwise deviations are drawn independently (diagonal covariance). Use this to generate correlated ground truth for validatingMaxDiffMixedLogit(full_covariance=True)recovery.- items
list[str], optional Item names (length
n_items). Defaults to["item_0", ...].- random_seed
np.random.Generatororint, optional Random state for reproducibility.
- n_respondents
- Returns:
- task_df
pd.DataFrame Long-format data with columns
respondent_id,task_id,item_id,is_best,is_worst. One row per shown item per task.- ground_truth
dict {"utilities", "respondent_utilities", "sigma_respondent", "item_correlation", "items"}.utilitiesis the population-level ground truth (reference item at 0);respondent_utilitiesholds per-respondent values used for simulation;item_correlationis the(n_items, n_items)correlation matrix used —np.eye(n_items)whenitem_correlationwas not supplied.
- task_df
Notes
Subsets are drawn uniformly without replacement. Real MaxDiff studies use balanced designs (BIBD) for efficiency; this generator trades that for simplicity and is adequate for parameter-recovery testing.
To verify that
MaxDiffMixedLogit(full_covariance=True)recovers the latent correlation structure, generate data with a non-identityitem_correlationand compare the posterior mean ofcorr_matrixagainstground_truth["item_correlation"].