PlayCoder is a novel multi-agent framework that addresses the critical challenge of repository-aware GUI application code generation. Unlike traditional approaches that focus solely on compilation or unit test success, PlayCoder ensures both syntactic correctness and behavioral alignment through dynamic testing and iterative refinement.
GUI applications present unique challenges for code generation: they require event-driven control flow, persistent application state, and complex user interaction patterns. Traditional evaluation methods miss critical behavioral failures - code may compile and run but exhibit silent logic flaws (e.g., collision detection errors in games, broken event handling).
PlayCoder addresses these challenges through two specialized agents:
- PlayDeveloper: Repository-aware code generation agent
- PlayRefiner: Automated program repair agent for iterative code refinement
- Key Features
- Evaluation Metrics
- Multi-Agent Architecture
- Dataset and Benchmark
- Quick Start
- Automated GUI Repository Processing
- Function Information Extraction
- AI-Driven Function Generation
- PlayTester: GUI Behavioral Testing
- Evaluation and Metrics
- Dependencies and Environment Setup
- Citation
- Beyond Compilation: Traditional metrics only check if code compiles and runs, missing critical behavioral failures
- Interactive Testing: PlayTester validates GUI applications through actual user interaction simulation
- Silent Failure Detection: Identifies logic flaws that don't cause crashes but break application functionality
- PlayDeveloper: Generates repository-aware code using retrieved patterns and module structures
- PlayRefiner: Analyzes execution traces, synthesizes patches, and applies fixes iteratively
- Exec@k: Measures successful execution without runtime errors
- Pass@k: Evaluates correctness against unit tests
- Play@k: Assesses semantic correctness through interactive GUI testing
- 43 GUI Applications across 6 categories: Game Emulation, Classic Games, MMORPG Games, Game Engine, Standalone Applications, and Desktop Widgets
- Multi-Language: Python, TypeScript, and JavaScript
- Framework Agnostic: Supports PyQt6, Pygame, React, Next.js, Svelte, and more
- Cross-Platform: Windows, macOS, and X11-based Linux distributions
PlayCoder introduces a hierarchical evaluation methodology that progressively assesses code quality: Exec@k (Execution Success), Pass@k (Unit Test Success), Play@k (Behavioral Correctness)
Example: In a Flappy Bird game, code might achieve 100% Exec@k and Pass@k but 0% Play@k if the bird can pass through obstacles without collision detection.
- Context-Aware Generation: PlayDeveloper generates repository-aware code using retrieved patterns
- Behavioral Testing: PlayTester launches applications and executes interaction sequences
- Diagnosis & Repair: PlayRefiner analyzes feedback and synthesizes targeted patches
- Iterative Feedback: Updated applications undergo re-testing until behavioral criteria are met
- Repository-Aware: Retrieves relevant code examples and import patterns from repository context
- Tool Integration: Uses ContextSearchTool, FileReadTool, BashTool, and ConversationTool
- Multi-LLM Support: Compatible with OpenAI, Anthropic, and other LLM providers
- Visual Observer: Captures application state via screenshots and window detection
- Action Executor: Translates test strategies into GUI operations (click, type, scroll, etc.)
- Test Manager: Uses vision-language models to analyze screenshots and plan interaction sequences
- Exception-Aware: Proactively checks for invalid moves, UI freezes, and termination conditions
- APR-Driven: Performs automated program repair based on behavioral feedback
- Three-Phase Process: Diagnosis → Patch Generation → Validation
- Repository Context: Uses ContextSearcher for repository-aware fixes
- Iterative Refinement: Continues until behavioral criteria are satisfied
PlayEval comprises 43 diverse GUI applications across three programming languages (Python, TypeScript, JavaScript) and six categories. Complete metadata is available in benchmark_metadata.json.
| # | Project | Language | Category | GitHub Stars | Archived | Framework | Inclusion Rationale |
|---|---|---|---|---|---|---|---|
| 1 | PyBoy | Python | Game Emulation | ~9.8k | No | Pygame | Only complete Python Game Boy emulator; high hardware-simulation complexity |
| 2 | 2048-python | Python | Classic Games | 354 | Yes | curses | Canonical puzzle-game; feature-complete and community-validated before archival |
| 3 | 2048 (Pygame) | Python | Classic Games | ~95 | No | Pygame | Classic tile-merging puzzle with non-trivial game loop |
| 4 | Snake (Pygame) | Python | Classic Games | ~95 | No | Pygame | Real-time movement and collision-detection logic |
| 5 | Flappy Bird (Pygame) | Python | Classic Games | ~95 | No | Pygame | Physics-based side-scroller with procedural obstacle generation |
| 6 | Sudoku (Pygame) | Python | Classic Games | ~95 | No | Pygame | Constraint-solving grid puzzle with interactive cell selection |
| 7 | Chrome Dragon (Pygame) | Python | Classic Games | ~95 | No | Pygame | Endless runner with procedural terrain and jump mechanics |
| 8 | Jupylet | Python | Game Engine | ~250 | No | OpenGL/Moderngl | Educational game engine; 12k LOC, complex rendering and event handling |
| 9 | python-chess | Python | Classic Games | 4 | No | Pygame | Complex strategy-game GUI with full ruleset logic and state management |
| 10 | shtosh-calculator | Python | Standalone Applications | 34 | No | PyQt6 | Representative small-scale PyQt6 app; excellent deployability |
| 11 | Browser | Python | Standalone Applications | ~4.5k | No | PyQt6 | Full-featured web browser built with PyQt6 WebEngine |
| 12 | Browser Tabbed | Python | Standalone Applications | ~4.5k | No | PyQt6 | Multi-tab browser variant; tests tab-management UI workflows |
| 13 | Calculator | Python | Standalone Applications | ~4.5k | No | PyQt6 | Scientific calculator; compact but non-trivial expression handling |
| 14 | Camera | Python | Standalone Applications | ~4.5k | No | PyQt6 | Live camera capture with frame display; tests real-time GUI updates |
| 15 | Crypto Wallet | Python | Standalone Applications | ~4.5k | No | PyQt6 | Cryptocurrency dashboard; rich data-binding and multi-panel layout |
| 16 | Currency Converter | Python | Standalone Applications | ~4.5k | No | PyQt6 | Live-data currency converter app; tests network-integrated GUI |
| 17 | Media Player | Python | Standalone Applications | ~4.5k | No | PyQt6 | Audio/video player with playback controls and progress display |
| 18 | Minesweeper | Python | Classic Games | ~4.5k | No | PyQt6 | Classic mine-clearing logic game with complete win/lose conditions |
| 19 | Notepad | Python | Standalone Applications | ~4.5k | No | PyQt6 | Plain-text editor with file I/O and find/replace |
| 20 | Notes App | Python | Standalone Applications | ~4.5k | No | PyQt6 | Sticky-notes manager; tests persistent storage and dynamic widget creation |
| 21 | Paint | Python | Standalone Applications | ~4.5k | No | PyQt6 | Raster drawing app; canvas event handling and tool state management |
| 22 | Solitaire | Python | Classic Games | ~4.5k | No | PyQt6 | Full Klondike solitaire with drag-and-drop card mechanics |
| 23 | Translator | Python | Standalone Applications | ~4.5k | No | PyQt6 | Language translation app; network API integration and async GUI |
| 24 | Unzip Utility | Python | Standalone Applications | ~4.5k | No | PyQt6 | Archive extraction with progress reporting and file browsing |
| 25 | Weather App | Python | Standalone Applications | ~4.5k | No | PyQt6 | Weather forecast with icon display and location search |
| 26 | Word Processor | Python | Standalone Applications | ~4.5k | No | PyQt6 | Rich-text editor with formatting and document management |
| 27 | Color Button | Python | Desktop Widgets | ~4.5k | No | PyQt6 | Color-picking push button widget |
| 28 | Equalizer Bar | Python | Desktop Widgets | ~4.5k | No | PyQt6 | Animated audio equalizer bar widget |
| 29 | Gradient Widget | Python | Desktop Widgets | ~4.5k | No | PyQt6 | Two-stop color gradient selector widget |
| 30 | Paint Widget | Python | Desktop Widgets | ~4.5k | No | PyQt6 | Embeddable drawing canvas widget |
| 31 | Color Palette | Python | Desktop Widgets | ~4.5k | No | PyQt6 | Click-to-select color palette picker |
| 32 | Password Edit | Python | Desktop Widgets | ~4.5k | No | PyQt6 | Password input with show/hide toggle |
| 33 | Power Bar | Python | Desktop Widgets | ~4.5k | No | PyQt6 | LED-style power-level indicator widget |
| 34 | Range Slider | Python | Desktop Widgets | ~4.5k | No | PyQt6 | Dual-handle range slider widget; fixed-size component |
| 35 | Toggle Switch | Python | Desktop Widgets | ~4.5k | No | PyQt6 | Animated on/off toggle button widget |
| 36 | react-tetris | JavaScript | Classic Games | ~8.7k | No | React/Redux | High-star React Tetris (~8.7k); validates JS game GUI generation |
| 37 | spotify-react-web-client | JavaScript | Standalone Applications | 283 | No | React | Large JS web-app (14k LOC); extends benchmark to complex real-world web GUIs |
| 38 | win11React | JavaScript | Standalone Applications | ~9.7k | No | React | Windows 11 desktop simulator (~9.7k LOC); browser-based OS-level GUI challenge |
| 39 | 2048-in-react | TypeScript | Classic Games | 234 | No | React/Next.js | TypeScript counterpart of 2048-python; enables cross-language comparison |
| 40 | CyberCodeOnline | TypeScript | MMORPG Games | ~1.3k | No | React | Full MMORPG with game loop and economy; evaluates complex TS game generation |
| 41 | biomes-game | TypeScript | MMORPG Games | ~2.6k | No | Next.js/Three.js | Open-source 3D MMORPG; tests 3D interactive GUI environment generation |
| 42 | macos-web | TypeScript | Standalone Applications | ~2.6k | No | Svelte | macOS desktop simulator (Svelte); adds TS+Svelte framework coverage |
| 43 | space-invaders | TypeScript | Classic Games | 56 | No | React/Canvas | Canvas-based Space Invaders; tests real-time animation logic across languages |
- Historically Active Development — commits within the past 12 months at time of selection, or ≥ 6 months of sustained development history with feature completeness before archival
- Community Validation — most projects have ≥ 100 GitHub stars (exceptions accepted when deployability and category representativeness are exemplary)
- Functional Completeness — applications demonstrate complete GUI workflows
- Framework Diversity — covers PyQt6, Pygame, Tkinter, React, Next.js, Svelte, Three.js
- Exemplary Value — non-trivial functions with ≥ 28 lines (Python) or ≥ 5 lines (JS/TS) after filtering, focusing on game-loops, event-handlers, and core application logic
- SECURITY.md — never commit tokens; historical PAT leak remediation and
git filter-repoinstructions. - THIRD_PARTY_NOTICES.md — vendored upstreams and baseline locations; each subtree may carry its own license.
The script clone_repos.sh (repository root) clones every unique owner/repo root referenced in benchmark_metadata.json for PlayEval (15 upstream Git roots covering all 43 benchmark entries; monorepos such as pythonguis/pythonguis-examples or NemoHoHaloAi/Game are cloned once).
Requires git and python3. Existing clones are updated with git pull --ff-only. For discovery-oriented crawling outside this fixed benchmark, use §6 cloneGIT.py, not this script.
chmod +x clone_repos.sh # once
./clone_repos.sh
SHALLOW=1 ./clone_repos.sh # or: ./clone_repos.sh --shallow
./clone_repos.sh --dest /path/to/output # absolute path, or relative to repo rootThe script clone_baselines.sh clones or updates agent baseline upstreams into fixed paths under this repository (same layout as in THIRD_PARTY_NOTICES.md). Existing Git checkouts are updated with git pull --ff-only. If a target path already exists but is not a Git working tree (no .git), the script exits with an error so local trees are never deleted automatically.
| Local path | Upstream |
|---|---|
Game_Tars/OmniParser |
microsoft/OmniParser |
baselines/DeepCode |
HKUDS/DeepCode |
baselines/MetaGPT |
FoundationAgents/MetaGPT |
baselines/OpenManus |
FoundationAgents/OpenManus |
Requires git. Shallow clones and a no-network plan preview are supported.
chmod +x clone_baselines.sh # once
./clone_baselines.sh
SHALLOW=1 ./clone_baselines.sh # or: ./clone_baselines.sh --shallow
./clone_baselines.sh --dry-run # or: DRY_RUN=1 ./clone_baselines.shSpec-style tests for this script live under tests/clone_baselines/ (bats-core: e.g. bats tests/clone_baselines/clone_baselines.bats).
# Minimal dependencies (recommended for security audit / function generation workflows)
pip install -r function_gen_requirements.txt
# Full benchmark + GUI/vision stack (only when needed)
pip install -r requirements.txt
pip install tiktokenor (Suggested)
mkdir -p ~/conda_envs/playcoder
tar -xzf conda_env.tar.gz -C ~/conda_envs/playcoder
conda activate ~/conda_envs/playcoderBefore you start to perform GUI testing, you should:
- Enable accessibility permissions for GUI automation (in MacOS privacy settings), otherwise the program will fail.
- Install Xcode command line tools:
xcode-select --install
# Apply patches and evaluate with Play@k (a simple demo), Provided by PlayEval
python apply_patches.py --patches Jsons/patches_origin_gpt-4o-mini_2048_test.json --GUI_test True --base-dir repos_GAME_python_demo --execution-mode
# Playback for generated repo (human scoring), use for quick start!
python replicate_GUI_test.py --log-file GUI_snap/gui_test_log_20251217_202553.json --log-dir GUI_snap
# Run PlayCoder multi-agent framework
python function_gen_cli.py --provider openai --model gpt-4o-mini run --input-file Jsons/extracted_functions_with_comments_all_sampled10.json --output-file Jsons/patches_agent.json If something strange happens, please manually copy the repos_RELAY (backup folder) to repos_GAME_python_demo to ensure the initial state of the repository is correct.
Configure API keys in openai_config.json:
{
"api_key": "your-openai-api-key",
"base_url": "https://api.openai.com/v1",
"model": "gpt-4o"
}Script: cloneGIT.py
- Function: Automatically crawls GitHub GUI application projects (including games, desktop apps, widgets) within specified criteria, cloning them to a local directory for analysis.
- Selection Criteria: Active development (commits within 6 months), community validation (high GitHub stars), functional completeness, framework diversity (PyQt, PySide, Tkinter, Pygame).
- Dependency:
PyGithub. SetGITHUB_TOKENin the environment, or use a local-only single-line file atdataset/token.txt(never commit secrets; see SECURITY.md).
Usage:
python cloneGIT.py- The default path and time window can be modified in the script.
- Custom query expressions (e.g., language, creation date, GUI framework) are supported.
Script: extract_function_info.py
- Function: Automatically extracts all Python function signatures, bodies, docstrings, complexity, call relations, and other structured information from the crawled repositories, outputting to JSON.
- Dependency:
tqdm, Python standard library.
Key Arguments:
--base-dir: Root directory of repositories to analyze (default:repos)--max-files: Maximum number of files to process--output-file: Output JSON file (default:Jsons/extracted_functions.json)--summary-only: Print summary statistics only, do not save full data
Usage:
python extract_function_info.py --base-dir <repo_dir> --max-files 100 --output-file Jsons/extracted_functions.jsonScript: generate_function_descriptions.py
- Function: Uses AI (e.g., OpenAI GPT) to automatically generate high-quality docstrings for each function, supporting batch and analysis modes.
- Dependency:
openai. Requires configuration inopenai_config.json.
Key Arguments:
--input-file: Input function info JSON (default:Jsons/extracted_functions.json)--output-file: Output enhanced JSON (default:Jsons/extracted_functions_with_comments.json)--config-file: OpenAI config file--max-functions: Maximum number of functions to process--batch-size: Number of functions per batch--analyze: Analyze generated comments--test-config: Test API configuration
Usage:
python generate_function_descriptions.py --input-file Jsons/extracted_functions.json --output-file Jsons/extracted_functions_with_comments.json --config-file openai_config.jsonScript: generate_functions_from_descriptions.py
- Function: Generates repository-level function code from comments, supporting three modes:
- Pure LLM (original prompt)
- Structured Chain of Thought (SCoT) (recommended)
- HCP-Coder (context-enhanced)
- Dependency:
openai,tree-sitter(optional)
Key Arguments:
--input-file: Input comments JSON (default:Jsons/extracted_functions_with_comments.json)--output-file: Output patches JSON (default:Jsons/patches.json)--config-file: OpenAI config--repo-path: Path to repository for context enhancement--max-functions: Maximum number of functions to process--batch-size: Number of functions per batch--use-scot/--no-scot: Enable/disable SCoT mode--analyze: Analyze generated patches--sample: Show sample functions--test-config: Test API configuration--demo: Show prompt only, do not call API
Usage:
python generate_functions_from_descriptions.py --input-file Jsons/extracted_functions_with_comments.json --output-file Jsons/patches.json --use-scot- For pure LLM mode: add
--no-scot - For HCP-Coder mode: add
--repo-path <repo_dir>
Script: function_gen_cli.py (entry point, calls function_gen_agent/cli.py)
- Function: Uses an agent-based method to generate repository-level function code from comments, supports APR (Automated Program Repair) switch, and multiple LLM backends.
- Dependency:
openai,anthropic(optional). Requiresfunction_gen_config.json.
Main Subcommands:
run: Batch function generationinteractive: Interactive agent modeshow-config: Show current configurationtest-provider: Test LLM connectioncreate-config: Generate config template
Key Arguments (for run):
--input-file: Input comments JSON--output-file: Output patches JSON--provider: LLM backend (openai/anthropic)--model: Model name--batch-size: Batch size--max-retries: Max retries--trajectory-file: Save agent trajectory--no-trajectory: Disable trajectory recording
Usage:
# Basic usage
python function_gen_cli.py run --input-file Jsons/extracted_functions_with_comments.json --output-file Jsons/patches_agent.json
# Specify model and backend
python function_gen_cli.py run --provider openai --model gpt-4o
# Interactive mode
python function_gen_cli.py interactive- APR-related arguments are supported; see
function_gen_agent/cli.pyfor details.
Script: generate_test_cases.py
- Function: Automatically generates high-quality unit/integration/functional/edge test cases for each repository, supporting multiple game types.
- Dependency:
openai. Requiresopenai_config.jsonandgame_config.json.
Key Arguments:
--base-dir: Root directory of repositories to analyze--config-file: Game config file--api-config: OpenAI config--max-files: Maximum number of files to process--results-file: Output test cases JSON--no-save: Do not save results
Usage:
python generate_test_cases.py --base-dir <repo_dir> --results-file game_test_cases.jsonDirectory: Game_Tars/
PlayTester is a specialized GUI testing agent that validates behavioral correctness through automated user interaction simulation. It implements multi-modal testing capabilities through three core components:
- Screenshot Capture: Uses
pyautoguiandPILfor application state capture - Window Detection: Platform-specific APIs (AppleScript on macOS, Win32 on Windows)
- State Analysis: Extracts structured information from visual elements (e.g., game grids, UI components)
- Change Detection: Compares frames to identify state transitions
- GUI Operations:
click(x, y),type(text),hotkey(keys),press(key),scroll(),wait() - Safety Mechanisms: Coordinate boundary checks and failsafe cursors
- Action Parsing: Structured LLM output parsing for precise control
- Execution History: Maintains logs for debugging and analysis
- Vision-Language Integration: Uses VLM to analyze screenshots and plan test strategies
- Behavioral Validation: Checks collision detection, event handling, state transitions
- Exception Detection: Proactively identifies UI freezes, invalid moves, termination errors
- Strategic Testing: Balances functionality validation with gameplay progression
- Silent Failure Detection: Identifies behavioral bugs that don't cause crashes
- Interactive Verification: Tests applications through actual user interaction patterns
- Cross-Platform Support: Works on Windows, macOS, and Linux
- Framework Agnostic: Supports PyQt, Tkinter, Pygame, and other GUI frameworks
- Automated Reporting: Generates comprehensive behavioral analysis reports
# Launch PlayTester for a 2048 game
playtester = PlayTester(app_path="2048.py")
results = playtester.run_behavioral_test(
max_interactions=100,
strategy="coverage_maximization"
)
print(f"Play@k Success: {results.behavioral_correctness}")See Game_Tars/README.md for detailed API documentation and advanced usage patterns.
PlayCoder's evaluation framework provides comprehensive assessment across three progressive criteria, demonstrating significant improvements over baseline approaches.
Script: apply_patches.py
The evaluation proceeds through three stages:
- Compilation and Execution: Measures Exec@k - successful execution without runtime errors
- Unit Testing: Evaluates Pass@k - correctness against comprehensive test suites
- Behavioral GUI Testing: Assesses Play@k - interactive behavioral correctness via PlayTester
Key Arguments:
--patches: Patches JSON file from PlayCoder agents--test-cases: Comprehensive test cases JSON--behavioral-testing: Enable PlayTester evaluation (Play@k)--backup-dir: Backup directory for rollback safety--output-report: Detailed evaluation report with all metrics--execution-mode: Fast Exec@k-only evaluation--play-mode: Full behavioral evaluation with PlayTester
Usage:
# Complete PlayCoder evaluation (Exec@k + Pass@k + Play@k)
python apply_patches.py --patches Jsons/patches_PlayCoder.json --test-cases test_cases.json --behavioral-testing
# Compare with baseline methods
python apply_patches.py --patches Jsons/patches_baseline.json --test-cases test_cases.json --behavioral-testing --output-report comparison_results.json
# Quick execution check only
python apply_patches.py --patches Jsons/patches.json --execution-mode- Behavioral Gap: Traditional baselines show significant degradation from Exec@k to Play@k (e.g., GPT-5 drops from 17.3% to 6.7%)
- PlayCoder Consistency: Multi-agent framework maintains higher performance across all metrics
- Silent Failure Detection: PlayTester identifies critical behavioral bugs missed by unit tests
- Model Agnostic: Improvements consistent across different LLM architectures
- Python: 3.8+ (recommended: 3.10+)
- Operating System: macOS, Windows, or Linux
- Memory: 8GB+ RAM (16GB+ recommended for large GUI applications)
- Display: GUI display required for PlayTester behavioral validation
# Install minimal dependencies first (recommended)
pip install -r function_gen_requirements.txt
# Install full stack only if you need PlayTester / GUI / OCR / YOLO workflows
pip install -r requirements.txt
# Essential packages
pip install openai anthropic tqdm pillow pyautogui opencv-python psutil requests
# GUI automation dependencies
pip install pyautogui pillow opencv-python
# Code analysis dependencies
pip install tree-sitter # For AST parsing and context extractionOpenAI Configuration (openai_config.json):
{
"api_key": "your-openai-api-key",
"base_url": "https://api.openai.com/v1",
"model": "gpt-4o",
"temperature": 0.2,
"max_tokens": 4096
}Anthropic Configuration (anthropic_config.json):
{
"api_key": "your-anthropic-api-key",
"model": "claude-3-sonnet-20240229",
"temperature": 0.2,
"max_tokens": 4096
}export OPENAI_API_KEY="your-openai-api-key"
export ANTHROPIC_API_KEY="your-anthropic-api-key"
export GITHUB_TOKEN="your-github-token" # For repository crawlingmacOS:
- Enable accessibility permissions for GUI automation
- Install Xcode command line tools:
xcode-select --install
Windows:
- Install Visual C++ Build Tools for native dependencies
- Ensure proper display scaling for screenshot accuracy
Linux:
- Install display server dependencies:
sudo apt-get install xvfb(for headless testing) - GUI framework dependencies:
sudo apt-get install python3-tk python3-pyqt5
If you use PlayCoder in your research, please cite our paper:
@inproceedings{PlayCoder2026,
title={PlayCoder: Making LLM-Generated GUI Code Playable},
author={Zhiyuan Peng, Wei Tao, Xin Yin, Chenhao Ying, Yuan Luo, Yiwen Guo},
booktitle={Proceedings of the 34th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering},
year={2026},
organization={ACM}
}This research addresses fundamental challenges in GUI application code generation through novel multi-agent collaboration and behavioral validation methodologies. The work demonstrates that coupling end-to-end GUI testing with repository-aware automated program repair represents an effective path toward reliable interactive application development.
We welcome academic collaboration and discussion. For questions about the research methodology, experimental setup, or implementation details, please submit an issue or contact the authors.
