Quantifying Stochasticity: A Multi-Run Comparative Analysis of LLM Consistency in PHP Filament Implementations
In the field of Large Language Model (LLM) evaluation, a fundamental challenge remains: the non-deterministic nature of generative outputs. When a single prompt is executed multiple times against the same model, the variance in the resulting codebases can range from negligible syntax shifts to catastrophic logic failures. This study investigates this phenomenon by subjecting six distinct LLMs to repeated trials of a complex, domain-specific task: implementing PHP Enums within the Filament Admin Panel framework.
The Benchmark: Testing Filament Standard Adherence
The objective of the benchmark was to evaluate whether the models could correctly implement Filament-specific standards, specifically regarding the integration of PHP Enums within forms and tables. A critical feature of Filament is the ability to leverage PHP Enums to manage UI elements like badges and colors via specific interfaces.
The prompt required the model to implement an Enum-based solution where the UI automatically reflects the Enum's state (e.g., using hasColor and hasLabel methods). The test was designed to see if the models could infer the necessity of these methods without explicit instructions to use the badge() component, testing their deep knowledge of the Filament ecosystem.
To ensure objective measurement, I implemented automated tests to verify if the Enums were being utilized correctly within the generated code.
Cross-Model Performance: Frontier vs. Emerging Models
The evaluation involved six models: Claude 3 Opus, GPT-5.5, Kimi K2.6, Gemini 3.1 Pro, GLM, and Minimax. The results revealed a clear hierarchy in reasoning capabilities and framework-specific knowledge.
The High-Performers: Frontier Models
The "Frontier" models—Claude 3 Opus and GPT-5.5—demonstrated near-perfect reliability. Across all three trials, both models successfully implemented the Enum logic without structural errors. Kimi K2.6 also performed at a high level, matching the accuracy of the top-tier models.
The Intermediate Tier: Gemini 3.1 Pro
Gemini 3.1 Pro showed significant competence but lacked total consistency. In two out of three trials, the model succeeded. However, the third trial resulted in a failure. Interestingly, the failure was not due to a misunderising of the Enum logic, but rather a regression in the broader codebase: the model introduced an incorrect namespace, leading to a 5/0 error that prevented the page from loading.
The Low-Performers: GLM and Minimax
The models from the GLM and Minimax families struggled significantly with the task:
- GLM: Demonstrated sporadic success. In one instance, it successfully implemented the
hasColormethod but failed to implementhasLabel, resulting in a broken UI where the label name defaulted to the raw Enum value. In subsequent runs, the model failed to utilize Enums entirely. - Minimax: Failed all trials. The generated code lacked any awareness of the Filament framework, rendering the feature non-functional.
Intra-Model Variance: The "Micro-Decision" Problem
A secondary, more nuanced hypothesis was tested: even when a model succeeds in the primary task, how much does the code vary between runs? To analyze this, I utilized Codex to perform a file-by-file, line-by-line comparison of the codebases, essentially running a git diff between the different attempts.
GPT-5.5: High Consistency, Low Variance
The analysis of GPT-5.5 revealed high levels of determinism. The Enum implementation remained identical across all three attempts. The differences were relegated to minor UI/UX "micro-decisions," such as:
- Adjusting
textarearows (e.g.,rows: 8). - Implementing
batchSortableproperties. - Slight variations in phrasing within the UI.
Claude 3 Opus: Higher Stochasticity
In contrast, Claude 3 Opus exhibited much higher variance. While the core Enum contract remained intact, the model made significant structural changes between runs, including:
- Return Type Annotations: Some attempts included explicit return types, while others provided simple strings.
- Laravel 1/3 Integration: One attempt utilized the
fillableproperty as a PHP attribute (consistent with newer Laravel standards), while others used the traditional array format. - UX Logic: One attempt proactively added a
filter()to the table component, a feature not explicitly requested but functionally useful.
Resource Utilization and Efficiency Metrics
Beyond accuracy, the evaluation looked at the operational cost and "usage" impact on the models' 5-hour rate limits.
| Metric | GPT-5.5 | Claude 3 Opus |
|---|---|---|
| 5-Hour Limit Consumption | 15% | 28% |
| Inference Speed | Slower | Faster |
| Token Efficiency | High (Fewer tokens used) | Low (Higher token consumption) |
An interesting inverse relationship was observed: GPT-5.5 was slower in terms of time-to-first-token but was more efficient with its token usage, consuming only 15% of the limit. Claude 3 Opus was significantly faster but much more "expensive" in terms of the 5-hour usage quota, consuming 28%.
Conclusion: The Necessity of Human-in-the-Loop
The data proves that even the most advanced LLMs are subject to unpredictable micro-decisions. While a model might pass an automated test for logic, it may simultaneously change return types, default values, or UI properties (like rows or sortable) without warning.
For developers, this means that "passing" code is not enough. We cannot rely on the LLM to maintain a consistent architectural style or UX standard across different prompts. Continuous verification of the "small details" is mandatory to prevent technical debt and inconsistent user interfaces.