Use the vitals package with ellmer to evaluate and compare the accuracy of LLMs, including writing evals to test local models.
OpenAI's new GPT-5 flagship failed half of my programming tests. Previous OpenAI releases have had just about perfect results. Now that OpenAI has enabled fallbacks to other LLMs, there are options.
Local beats the cloud ...