
Summary: This challenge supports early-stage development and validation of the SurveyResponder Python package—a tool for generating and analyzing synthetic survey data using large language models (LLMs). Participants will assess whether LLMs can produce realistic and demographically fair survey responses by comparing variation across models and personas. The focus is on designing and evaluating testing strategies, identifying bias, and comparing LLM outputs to known human response patterns. Work from this challenge may contribute to more trustworthy use of LLM-generated survey data in research.
Method areas: Inferential statistics (e.g., t-tests, ANOVA), response variability analysis, psychometric validation, LLM-based text generation.
Prerequisites: Familiarity with LLMs (e.g., via Ollama or AnywhereLLM) and basic survey research or psychometric principles is recommended. Experience with pandas and seaborn/matplotlib will be helpful for analysis and visualization.
- Description & Goal
- Data
- Prerequisites
- Resources for Getting Started
- Launch Date & Data Release
- Contact
This project explores the early-stage development of the SurveyResponder Python package—a tool designed to assist researchers, developers, and psychometricians with generating, scoring, and evaluating synthetic survey responses. The aim is to validate and assess the tool’s performance across various large language models (LLMs), with a focus on identifying potential bias, differences in output variation, and alignment with human-like response patterns.
By comparing output from multiple LLMs, participants will help determine how reliably the tool simulates survey responses. The project implements a key step in the development of this tool: ideating and testing a testing approach. Ultimately this work /may contribute to more accurate, trustworthy use of LLM-generated survey data in research and development.
Suggested Research Questions
- Do LLMs exhibit demographic bias when given personas defined by race, gender, and other characteristics?
- How do different LLMs vary in their response patterns (e.g., higher or lower standard deviation)?
- Which LLMs generate responses most consistent with human data?
Method Areas
- Rudimentary inferential statistical evaluation (e.g., variation, standard deviation, accuracy – t-tests or ANOVA)
- Natural language generation using large language models (LLMs)
- Psychometric validation and bias testing
Pre-generated datasets from the SurveyResponder tool are provided in CSV format, with each entry including:
- Model and persona identifiers
- Associated JSON files detailing random persona characteristic
Access the data here:
Google Drive Folder
An optional step would be for project participants to also use the tool itself to generate new outputs.
Generating Additional Data
Participants may generate their own datasets using the SurveyResponder repository. The tool supports creating new responses using different LLMs, enabling customized testing.
- Repository: github.com/adamrossnelson/SurveyResponder
- Setup Instructions: ReadMe.md
- Required: Python + Git/GitHub basics
- Recommended: Familiarity with LLMs (Ollama, or AnywhereLLM; Basic understanding of psychometric principles or rudimentary survey research practices; Data analysis and visualization skills (e.g., using pandas, seaborn, or matplotlib).
Coming soon!
This challenge will launch on Sept. 11, 2025 (MLM25 kickoff), and will be hosted outside of Kaggle (MLM25 only).
If you have any questions about participating in this challenge, please contact Adam Ross Nelson (arnelson3@wisc.edu).