# NEW OpenAI GPT-4.1 Models Tested: Comprehensive Performance Comparison
In this video, I break down OpenAI's latest release - the GPT-4.1 family of models (4.1, 4.1 mini, and 4.1 nano) - and put them through rigorous real-world testing against their predecessors.
✅ What's covered:
- Full pricing comparison between all models
- Performance on harmful question detection tests
- Named entity recognition capabilities
- SQL code generation accuracy
- Retrieval augmented generation performance
The results? While GPT-4.1 shows significant improvements in code generation, there are some surprising regressions in other areas. GPT-4.1 mini offers excellent value, outperforming expectations in SQL generation, while the ultra-affordable 4.1 nano shows promise but with limitations.
Whether you're considering using these models for production applications or just curious about the latest AI advancements, this detailed breakdown will help you understand the practical differences between OpenAI's offerings.
Leave a comment with your experiences using the new 4.1 models!
**00:00 - Introduction & GPT-4.1 Announcement**
* 00:01 - Greeting and video topic introduction.
* 00:04 - Announcing OpenAI's release of GPT-4.1.
* 00:08 - Mentioning the new model family: 4.1, 4.1-mini, 4.1-nano.
* 00:16 - Highlighting key features: 1 million token context, improvements in math/coding.
**00:23 - Pricing Breakdown**
* 00:25 - Starting the pricing comparison.
* 00:26 - GPT-4.1 pricing ($2 input, $8 output).
* 00:31 - GPT-4.1-mini pricing ($0.40 input, $1.60 output).
* 00:38 - GPT-4.1-nano pricing ($0.10 input, $0.40 output).
* 00:44 - Noting pricing comparison oddities between 4.1/4o and mini versions.
**01:00 - Testing Methodology**
* 01:01 - Introduction to the testing rubric.
* 01:03 - Listing the tests: Harmful Question Detection, Named Entity Recognition (NER), SQL Code Generation, Retrieval Augmented Generation (RAG).
* 01:21 - Explaining the purpose and design of the tests (real-world use cases, difficulty, reliability).
**01:43 - Sponsor Break (Prompt Judy)**
* 01:44 - Start of the Prompt Judy promotion.
* 01:47 - Request for channel subscription.
* 01:58 - Overview of Prompt Judy platform features (evaluations, testing).
**02:30 - Test 1: Harmful Question Detection**
* 02:31 - Introducing the Harmful Question Detection test.
* 02:40 - Showing and explaining the prompt for this test.
* 03:10 - Presenting the results (GPT-4.1 & 4o: 100%, 4o-mini: 95%, 4.1-mini: 90%, 4.1-nano: 60%).
* 03:25 - Analyzing errors made by 4.1-mini and 4o-mini.
* 03:45 - Analyzing errors made by 4.1-nano and its poor performance.
**04:13 - Test 2: Named Entity Recognition (NER)**
* 04:14 - Introducing the NER test (structured JSON extraction).
* 04:16 - Showing and explaining the prompt for this test (entities, requirements, ISO codes).
* 05:27 - Presenting the results (GPT-4o: 95.24%, 4.1: 80.95%, 4.1-mini: 66.67%, 4o-mini: 61.90%, 4.1-nano: 42.86%).
* 05:37 - Analyzing GPT-4o's specific error (Shanghai province code).
* 06:03 - Noting the correlation between model cost/size and performance on this test.
* 06:12 - Showing examples of errors made by the other models (province codes, translations).
**06:41 - Test 3: SQL Code Generation**
* 06:42 - Introducing the SQL Code Generation test.
* 06:46 - Showing and explaining the prompt for this test (rules, schema).
* 07:25 - Presenting the results (Surprising winner: 4.1-mini at 100%).
* 07:51 - Ranking the other models (4.1: 95%, 4o: 85%, 4o-mini & 4.1-nano: 80%).
* 08:10 - Discussing how results support OpenAI's claim of better code generation in 4.1 models.
**08:34 - Test 4: Retrieval Augmented Generation (RAG)**
* 08:35 - Introducing the RAG test.
* 08:36 - Showing and explaining the prompt for this test (rules, citations, formatting).
* 09:21 - Presenting the results (4o-mini & 4o: 100%, 4.1: 95%, 4.1-nano: 93.25%, 4.1-mini: 80%).
* 09:49 - Analyzing the error made by GPT-4.1 (hallucination about o1 vs GPT-4o).
* 10:17 - Analyzing the error made by GPT-4.1-mini (wrong language response).
**10:25 - Overall Summary & Conclusion**
* 10:26 - Summarizing performance: 4.1 generally better at coding, but regressions in RAG/NER compared to 4o.
* 10:40 - Final thoughts and call for viewer comments.
* 10:46 - Sign-off.
#OpenAI #GPT41 #AITesting #MachineLearning #LLM #AIComparison