We propose a framework, ECHO, that distills collective discussion about a new generative model into a structured benchmark. As a case study, we apply ECHO to GPT-4o Image Gen on Twitter/X. Here, we display a few diverse and novel tasks surfaced by ECHO.
Recent advances in image generation, often driven by proprietary systems like GPT-4o Image Gen, regularly introduce new capabilities that reshape how users interact with these models. Existing benchmarks often lag behind and fail to capture these emerging use cases, leaving a gap between community perceptions of progress and formal evaluation. To address this, we present ECHO, a framework for constructing benchmarks directly from real-world evidence of model use: social media posts that showcase novel prompts and qualitative user judgments. Applying this framework to GPT-4o Image Gen, we construct a dataset of over 35,000 prompts curated from such posts. Our analysis shows that ECHO (1) discovers creative and complex tasks absent from existing benchmarks, such as re-rendering product labels across languages or generating receipts with specified totals, (2) more clearly distinguishes state-of-the-art models from alternatives, and (3) surfaces community feedback that we use to inform the design of metrics for model quality (e.g., measuring observed shifts in color, identity, and structure).
To collect these examples, we develop a framework called ECHO: Extracting Community Hatched Observations. We design this framework to address a number of challenges inherent to social media. Click through the tabs below to learn more about each step.
Large-scale collection is bottlenecked by a volume-relevance tradeoff. When querying with broader keywords, the average post relevance goes down, and with narrower ones, the available post pool is quickly exhausted. We therefore implement a two-stage pipeline, where we first query for a large volume of posts then use an LLM to filter irrelevant ones.
Posts can be context dependent. For example, a user may write "prompt below" in the first post then include the actual prompt text in a reply. To extract self-contained prompts, our framework attempts to collect as much of the reply tree as possible, then use this full context when processing posts into samples.
Useful data exists in non-standard formats. The output image could be the first or the last in a series of images, the prompt may be written in an incomplete fill-in-the-blank format, or data may be embedded in a screenshot. We process these cases with a VLM, which is responsible for classifying input vs. output images, filling in blanks, or parsing screenshots.
Data quality varies widely. A user may provide more general commentary or exactly document their input prompt. We separate prompts into two groups: moderate-quality ones suited for analysis, and high-quality ones appropriate for benchmarking.
Using the ECHO framework, we are able to collect prompts that are highly diverse and closer to natural user language. We show a t-SNE visualization comparing the prompts from ECHO and prior text-to-image or image-to-image datasets. Prior datasets look quite different; they often contain keyword lists tailored towards Stable Diffusion (Pick-a-Pic, ImageReward), templated text with fixed structures (GenEval), or limited task types characterized by a small set of first bigrams (GEdit, MagicBrush, InstructPix2Pix). Conversely, prompts from ECHO are longer and specify a broader range of tasks. Hover over each point to view the prompts, and click on each legend item to toggle each dataset.
ECHO not only extracts prompts but also community feedback, or qualitative comments on generated outputs. We use an LLM to annotate whether each comment discusses a success or failure, and extract keywords. We visualize the keywords of all failures in the following word cloud. Select from the dropdown to explore some of the suggested keywords, which highlight the attributes that users are sensitive to (e.g., "identity", "aspect ratio", "proportions", text accuracy", "color balance", "coherency"). You can also click the word cloud to explore examples.
We then compare the performance of a range of models on ECHO samples, for both image-to-image and text-to-image tasks. We evaluate both open-source (Anole, Bagel) and closed-source (4o Image Gen, Nano Banana, Gemini 2.0 Flash) unified models. We also evaluate the most naive implementation of a "unified model," GPT4o chained to DALLE-3 (LLM+Diffusion). Finally, we compare a state-of-the-art image editing model (Flux Kontext). We report the win rate computed via an ensemble of VLM-as-a-judge, following the "single answer grading" setup from MT-Bench.
We also design several specialized automated metrics, inspired by the failure categories discovered from the community feedback: color shift magnitude, face identity similarity, structure distance, and text rendering accuracy. For each metric, we use an LLM to classify samples where each metric is applicable, and compute the metric over these samples. While 4o Image Gen may be proficient in overall instruction following, users also notice that it exhibits large shifts in color and face identity, which we validate and quantify. Click the dropdown to take a look at each metric, and toggle the slider to see example model outputs given the same input!
Since the ECHO framework operates on a large-scale, it can give insight into activity surrounding a model of interest. We plot the timestamps of posts, after relevance and quality filtering, bucketed by day. Spikes in activity (highlighted in yellow) often align with real-world events, including: the day after the 4o Image Gen model release (Mar 26), the day of the o3 model release (Apr 16), and the day after the 4o Image Gen API release (Apr 24).
We thank Stephanie Fu, Michelle Li, and Alexander Pan for their helpful feedback. We also thank the folks at Stochastic Labs for previewing early prototypes of this work. Finally, we extend a special thank you to Lisa Dunlap for entertaining many extensive discussions on evaluations.
Grace really enjoys web design, so here's a short epilogue discussing all the bells and whistles. It turns out, it's possible to make the Nerfies template feel more "bloggy" simply by aligning text to the left instead of center, then moving paragraphs before figures instead of after. We then used Plotly for all the interactive visualizations. Final touches include: a loading screen for the visualizations, a "back to top" button anchored on the bottom right, and a lot of breakpoint magic for mobile viewing. Enjoy!
@article{ge2025echo,
title={Constantly Improving Image Models Need Constantly Improving Benchmarks},
author={Jiaxin Ge, Grace Luo, Heekyung Lee, Nishant Malpani, Long Lian, XuDong Wang, Aleksander Holynski, Trevor Darrell, Sewon Min, David M. Chan},
journal={arXiv preprint arXiv:2510.15021},
year={2025}
}