ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents
Abstract
A modular multi-agent framework named ScreenCoder decomposes UI design-to-code translation into grounding, planning, and generation stages, achieving superior layout accuracy and code correctness through specialized agents and fine-tuned multimodal models.
Automating the transformation of user interface (UI) designs into front-end code holds significant promise for accelerating software development and democratizing design workflows. While multimodal large language models (MLLMs) can translate images to code, they often fail on complex UIs, struggling to unify visual perception, layout planning, and code synthesis within a single monolithic model, which leads to frequent perception and planning errors. To address this, we propose ScreenCoder, a modular multi-agent framework that decomposes the task into three interpretable stages: grounding, planning, and generation. By assigning these distinct responsibilities to specialized agents, our framework achieves significantly higher robustness and fidelity than end-to-end approaches. Furthermore, ScreenCoder serves as a scalable data engine, enabling us to generate high-quality image-code pairs. We use this data to fine-tune open-source MLLM via a dual-stage pipeline of supervised fine-tuning and reinforcement learning, demonstrating substantial gains in its UI generation capabilities. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in layout accuracy, structural coherence, and code correctness. Our code is made publicly available at https://github.com/leigest519/ScreenCoder.
Community
ScreenCoder is a modular multi-agent framework that advances UI-to-code generation by integrating visual grounding, hierarchical planning, and adaptive code synthesis.
Try it at: https://huggingface.co/spaces/Jimmyzheng-10/ScreenCoder
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Improved Iterative Refinement for Chart-to-Code Generation via Structured Instruction (2025)
- PresentAgent: Multimodal Agent for Presentation Video Generation (2025)
- MLLM-Based UI2Code Automation Guided by UI Layout Information (2025)
- SmartAvatar: Text- and Image-Guided Human Avatar Generation with VLM AI Agents (2025)
- GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents (2025)
- LOCOFY Large Design Models -- Design to code conversion solution (2025)
- DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
arXiv explained breakdown of this paper 👉 https://arxivexplained.com/papers/screencoder-advancing-visual-to-code-generation-for-front-end-automation-via-modular-multimodal-agents
Get this paper in your agent:
hf papers read 2507.22827 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper