Introduction
When you ask a top-tier AI model whether it's safe to pass on a double-yellow line, it answers correctly 95% of the time. When you ask it how many questions are on the Vermont DMV permit test, it gets it wrong every time. The leading commercial AIs can pass most U.S. state permit tests; where they fail, they fail in a specific and revealing way.
In May 2026, dmvpermit.com ran 30-question samples from every U.S. state's official DMV permit test bank — 1,530 questions across 51 jurisdictions — through five AI models: two leading commercial systems (OpenAI's GPT-4.1 and Google's Gemini 2.5 Flash) and a 30-billion-parameter open-weight reasoning model (NVIDIA's nemotron-3-nano:30b) running locally on a DGX Spark. Every model received an identical prompt at temperature 0.0. Pass threshold: 80% — the typical state-DMV passing score.
We ran the experiment to answer two questions a journalist, a permit-prep app owner, and an AI-skeptic parent might all care about: can a chatbot pass a state permit test, and if it mostly can, where does it still fall short? The answer to both turns out to be more interesting than a simple pass/fail.
The topline
- OpenAI GPT-4.1 averaged 90.1% across 51 states and would pass the test in 49 of them — a 96% pass rate.
- Google Gemini 2.5 Flash averaged 80.0% and passed in 32 of 51 states — exactly at the threshold (63% pass rate).
- nemotron-3-nano:30b, the open-weight 30-billion-parameter reasoning model, averaged 42.8% and passed in zero of the 46 states we evaluated.
- Of the two states GPT-4.1 failed (Iowa at 73%, Vermont at 77%), every single missed question was about state-specific administrative procedure — not about driving rules.
AI vs DMV — Leaderboard
148 state-model pairs across 51 states × 3 models. Pass threshold: 80%.
Overall by model
| Model | Avg score | States passed | Pass rate | Min — Max | N |
|---|---|---|---|---|---|
| openai/gpt-4.1 | 90.1% | 49/51 | 96% | 73% — 100% | 51 |
| gemini/gemini-2.5-flash | 80.0% | 32/51 | 63% | 10% — 100% | 51 |
| nemotron-3-nano:30b | 42.8% | 0/46 | 0% | 30% — 67% | 46 |
Per-state, per-model
| State ↑ | Model | Score | Result |
|---|---|---|---|
| AK (Alaska) | gemini/gemini-2.5-flash | 86.7% (26/30) | PASS |
| AK (Alaska) | nemotron-3-nano:30b | 36.7% (11/30) | FAIL |
| AK (Alaska) | openai/gpt-4.1 | 96.7% (29/30) | PASS |
| AL (Alabama) | gemini/gemini-2.5-flash | 90.0% (27/30) | PASS |
| AL (Alabama) | nemotron-3-nano:30b | 40.0% (12/30) | FAIL |
| AL (Alabama) | openai/gpt-4.1 | 96.7% (29/30) | PASS |
| AR (Arkansas) | gemini/gemini-2.5-flash | 90.0% (27/30) | PASS |
| AR (Arkansas) | nemotron-3-nano:30b | 50.0% (15/30) | FAIL |
| AR (Arkansas) | openai/gpt-4.1 | 96.7% (29/30) | PASS |
| AZ (Arizona) | gemini/gemini-2.5-flash | 90.0% (27/30) | PASS |
| AZ (Arizona) | nemotron-3-nano:30b | 40.0% (12/30) | FAIL |
| AZ (Arizona) | openai/gpt-4.1 | 86.7% (26/30) | PASS |
| CA (California) | gemini/gemini-2.5-flash | 90.0% (27/30) | PASS |
| CA (California) | nemotron-3-nano:30b | 56.7% (17/30) | FAIL |
| CA (California) | openai/gpt-4.1 | 100.0% (30/30) | PASS |
| CO (Colorado) | gemini/gemini-2.5-flash | 86.7% (26/30) | PASS |
| CO (Colorado) | nemotron-3-nano:30b | 30.0% (9/30) | FAIL |
| CO (Colorado) | openai/gpt-4.1 | 90.0% (27/30) | PASS |
| CT (Connecticut) | gemini/gemini-2.5-flash | 76.7% (23/30) | FAIL |
| CT (Connecticut) | nemotron-3-nano:30b | 30.0% (9/30) | FAIL |
| CT (Connecticut) | openai/gpt-4.1 | 86.7% (26/30) | PASS |
| DC (District of Columbia) | gemini/gemini-2.5-flash | 86.7% (26/30) | PASS |
| DC (District of Columbia) | nemotron-3-nano:30b | 50.0% (15/30) | FAIL |
| DC (District of Columbia) | openai/gpt-4.1 | 100.0% (30/30) | PASS |
| DE (Delaware) | gemini/gemini-2.5-flash | 73.3% (22/30) | FAIL |
| DE (Delaware) | nemotron-3-nano:30b | 43.3% (13/30) | FAIL |
| DE (Delaware) | openai/gpt-4.1 | 83.3% (25/30) | PASS |
| FL (Florida) | gemini/gemini-2.5-flash | 73.3% (22/30) | FAIL |
| FL (Florida) | nemotron-3-nano:30b | 33.3% (10/30) | FAIL |
| FL (Florida) | openai/gpt-4.1 | 80.0% (24/30) | PASS |
| GA (Georgia) | gemini/gemini-2.5-flash | 86.7% (26/30) | PASS |
| GA (Georgia) | nemotron-3-nano:30b | 43.3% (13/30) | FAIL |
| GA (Georgia) | openai/gpt-4.1 | 86.7% (26/30) | PASS |
| HI (Hawaii) | gemini/gemini-2.5-flash | 76.7% (23/30) | FAIL |
| HI (Hawaii) | nemotron-3-nano:30b | 50.0% (15/30) | FAIL |
| HI (Hawaii) | openai/gpt-4.1 | 96.7% (29/30) | PASS |
| IA (Iowa) | gemini/gemini-2.5-flash | 80.0% (24/30) | PASS |
| IA (Iowa) | nemotron-3-nano:30b | 33.3% (10/30) | FAIL |
| IA (Iowa) | openai/gpt-4.1 | 73.3% (22/30) | FAIL |
| ID (Idaho) | gemini/gemini-2.5-flash | 66.7% (20/30) | FAIL |
| ID (Idaho) | nemotron-3-nano:30b | 30.0% (9/30) | FAIL |
| ID (Idaho) | openai/gpt-4.1 | 93.3% (28/30) | PASS |
| IL (Illinois) | gemini/gemini-2.5-flash | 66.7% (20/30) | FAIL |
| IL (Illinois) | nemotron-3-nano:30b | 36.7% (11/30) | FAIL |
| IL (Illinois) | openai/gpt-4.1 | 90.0% (27/30) | PASS |
| IN (Indiana) | gemini/gemini-2.5-flash | 73.3% (22/30) | FAIL |
| IN (Indiana) | nemotron-3-nano:30b | 40.0% (12/30) | FAIL |
| IN (Indiana) | openai/gpt-4.1 | 100.0% (30/30) | PASS |
| KS (Kansas) | gemini/gemini-2.5-flash | 66.7% (20/30) | FAIL |
| KS (Kansas) | nemotron-3-nano:30b | 50.0% (15/30) | FAIL |
| KS (Kansas) | openai/gpt-4.1 | 93.3% (28/30) | PASS |
| KY (Kentucky) | gemini/gemini-2.5-flash | 56.7% (17/30) | FAIL |
| KY (Kentucky) | nemotron-3-nano:30b | 36.7% (11/30) | FAIL |
| KY (Kentucky) | openai/gpt-4.1 | 80.0% (24/30) | PASS |
| LA (Louisiana) | gemini/gemini-2.5-flash | 73.3% (22/30) | FAIL |
| LA (Louisiana) | nemotron-3-nano:30b | 66.7% (20/30) | FAIL |
| LA (Louisiana) | openai/gpt-4.1 | 93.3% (28/30) | PASS |
| MA (Massachusetts) | gemini/gemini-2.5-flash | 70.0% (21/30) | FAIL |
| MA (Massachusetts) | nemotron-3-nano:30b | 46.7% (14/30) | FAIL |
| MA (Massachusetts) | openai/gpt-4.1 | 96.7% (29/30) | PASS |
| MD (Maryland) | gemini/gemini-2.5-flash | 80.0% (24/30) | PASS |
| MD (Maryland) | nemotron-3-nano:30b | 43.3% (13/30) | FAIL |
| MD (Maryland) | openai/gpt-4.1 | 96.7% (29/30) | PASS |
| ME (Maine) | gemini/gemini-2.5-flash | 73.3% (22/30) | FAIL |
| ME (Maine) | nemotron-3-nano:30b | 36.7% (11/30) | FAIL |
| ME (Maine) | openai/gpt-4.1 | 96.7% (29/30) | PASS |
| MI (Michigan) | gemini/gemini-2.5-flash | 73.3% (22/30) | FAIL |
| MI (Michigan) | nemotron-3-nano:30b | 36.7% (11/30) | FAIL |
| MI (Michigan) | openai/gpt-4.1 | 83.3% (25/30) | PASS |
| MN (Minnesota) | gemini/gemini-2.5-flash | 66.7% (20/30) | FAIL |
| MN (Minnesota) | nemotron-3-nano:30b | 33.3% (10/30) | FAIL |
| MN (Minnesota) | openai/gpt-4.1 | 83.3% (25/30) | PASS |
| MO (Missouri) | gemini/gemini-2.5-flash | 70.0% (21/30) | FAIL |
| MO (Missouri) | nemotron-3-nano:30b | 36.7% (11/30) | FAIL |
| MO (Missouri) | openai/gpt-4.1 | 80.0% (24/30) | PASS |
| MS (Mississippi) | gemini/gemini-2.5-flash | 86.7% (26/30) | PASS |
| MS (Mississippi) | nemotron-3-nano:30b | 33.3% (10/30) | FAIL |
| MS (Mississippi) | openai/gpt-4.1 | 93.3% (28/30) | PASS |
| MT (Montana) | gemini/gemini-2.5-flash | 93.3% (28/30) | PASS |
| MT (Montana) | nemotron-3-nano:30b | 36.7% (11/30) | FAIL |
| MT (Montana) | openai/gpt-4.1 | 90.0% (27/30) | PASS |
| NC (North Carolina) | gemini/gemini-2.5-flash | 90.0% (27/30) | PASS |
| NC (North Carolina) | nemotron-3-nano:30b | 53.3% (16/30) | FAIL |
| NC (North Carolina) | openai/gpt-4.1 | 93.3% (28/30) | PASS |
| ND (North Dakota) | gemini/gemini-2.5-flash | 86.7% (26/30) | PASS |
| ND (North Dakota) | nemotron-3-nano:30b | 60.0% (18/30) | FAIL |
| ND (North Dakota) | openai/gpt-4.1 | 93.3% (28/30) | PASS |
| NE (Nebraska) | gemini/gemini-2.5-flash | 73.3% (22/30) | FAIL |
| NE (Nebraska) | nemotron-3-nano:30b | 30.0% (9/30) | FAIL |
| NE (Nebraska) | openai/gpt-4.1 | 86.7% (26/30) | PASS |
| NH (New Hampshire) | gemini/gemini-2.5-flash | 83.3% (25/30) | PASS |
| NH (New Hampshire) | nemotron-3-nano:30b | 50.0% (15/30) | FAIL |
| NH (New Hampshire) | openai/gpt-4.1 | 80.0% (24/30) | PASS |
| NJ (New Jersey) | gemini/gemini-2.5-flash | 90.0% (27/30) | PASS |
| NJ (New Jersey) | nemotron-3-nano:30b | 36.7% (11/30) | FAIL |
| NJ (New Jersey) | openai/gpt-4.1 | 90.0% (27/30) | PASS |
| NM (New Mexico) | gemini/gemini-2.5-flash | 86.7% (26/30) | PASS |
| NM (New Mexico) | nemotron-3-nano:30b | 50.0% (15/30) | FAIL |
| NM (New Mexico) | openai/gpt-4.1 | 90.0% (27/30) | PASS |
| NV (Nevada) | gemini/gemini-2.5-flash | 86.7% (26/30) | PASS |
| NV (Nevada) | nemotron-3-nano:30b | 40.0% (12/30) | FAIL |
| NV (Nevada) | openai/gpt-4.1 | 86.7% (26/30) | PASS |
| NY (New York) | gemini/gemini-2.5-flash | 93.3% (28/30) | PASS |
| NY (New York) | nemotron-3-nano:30b | 36.7% (11/30) | FAIL |
| NY (New York) | openai/gpt-4.1 | 86.7% (26/30) | PASS |
| OH (Ohio) | gemini/gemini-2.5-flash | 90.0% (27/30) | PASS |
| OH (Ohio) | nemotron-3-nano:30b | 30.0% (9/30) | FAIL |
| OH (Ohio) | openai/gpt-4.1 | 83.3% (25/30) | PASS |
| OK (Oklahoma) | gemini/gemini-2.5-flash | 96.7% (29/30) | PASS |
| OK (Oklahoma) | nemotron-3-nano:30b | 43.3% (13/30) | FAIL |
| OK (Oklahoma) | openai/gpt-4.1 | 93.3% (28/30) | PASS |
| OR (Oregon) | gemini/gemini-2.5-flash | 90.0% (27/30) | PASS |
| OR (Oregon) | nemotron-3-nano:30b | 46.7% (14/30) | FAIL |
| OR (Oregon) | openai/gpt-4.1 | 93.3% (28/30) | PASS |
| PA (Pennsylvania) | gemini/gemini-2.5-flash | 100.0% (30/30) | PASS |
| PA (Pennsylvania) | nemotron-3-nano:30b | 43.3% (13/30) | FAIL |
| PA (Pennsylvania) | openai/gpt-4.1 | 96.7% (29/30) | PASS |
| RI (Rhode Island) | gemini/gemini-2.5-flash | 90.0% (27/30) | PASS |
| RI (Rhode Island) | nemotron-3-nano:30b | 63.3% (19/30) | FAIL |
| RI (Rhode Island) | openai/gpt-4.1 | 93.3% (28/30) | PASS |
| SC (South Carolina) | gemini/gemini-2.5-flash | 96.7% (29/30) | PASS |
| SC (South Carolina) | nemotron-3-nano:30b | 53.3% (16/30) | FAIL |
| SC (South Carolina) | openai/gpt-4.1 | 93.3% (28/30) | PASS |
| SD (South Dakota) | gemini/gemini-2.5-flash | 93.3% (28/30) | PASS |
| SD (South Dakota) | nemotron-3-nano:30b | 46.7% (14/30) | FAIL |
| SD (South Dakota) | openai/gpt-4.1 | 96.7% (29/30) | PASS |
| TN (Tennessee) | gemini/gemini-2.5-flash | 86.7% (26/30) | PASS |
| TN (Tennessee) | nemotron-3-nano:30b | 30.0% (9/30) | FAIL |
| TN (Tennessee) | openai/gpt-4.1 | 86.7% (26/30) | PASS |
| TX (Texas) | gemini/gemini-2.5-flash | 90.0% (27/30) | PASS |
| TX (Texas) | nemotron-3-nano:30b | 46.7% (14/30) | FAIL |
| TX (Texas) | openai/gpt-4.1 | 93.3% (28/30) | PASS |
| UT (Utah) | gemini/gemini-2.5-flash | 93.3% (28/30) | PASS |
| UT (Utah) | nemotron-3-nano:30b | 50.0% (15/30) | FAIL |
| UT (Utah) | openai/gpt-4.1 | 90.0% (27/30) | PASS |
| VA (Virginia) | gemini/gemini-2.5-flash | 100.0% (30/30) | PASS |
| VA (Virginia) | nemotron-3-nano:30b | 60.0% (18/30) | FAIL |
| VA (Virginia) | openai/gpt-4.1 | 93.3% (28/30) | PASS |
| VT (Vermont) | gemini/gemini-2.5-flash | 86.7% (26/30) | PASS |
| VT (Vermont) | openai/gpt-4.1 | 76.7% (23/30) | FAIL |
| WA (Washington) | gemini/gemini-2.5-flash | 10.0% (3/30) | FAIL |
| WA (Washington) | openai/gpt-4.1 | 90.0% (27/30) | PASS |
| WI (Wisconsin) | gemini/gemini-2.5-flash | 36.7% (11/30) | FAIL |
| WI (Wisconsin) | openai/gpt-4.1 | 83.3% (25/30) | PASS |
| WV (West Virginia) | gemini/gemini-2.5-flash | 36.7% (11/30) | FAIL |
| WV (West Virginia) | openai/gpt-4.1 | 96.7% (29/30) | PASS |
| WY (Wyoming) | gemini/gemini-2.5-flash | 90.0% (27/30) | PASS |
| WY (Wyoming) | openai/gpt-4.1 | 86.7% (26/30) | PASS |
148 of 148 pairs shown.
Where AI fails: state-specific minutiae
The most interesting finding from the study is not that GPT-4.1 passes most states. It's where the model breaks down on the two states it fails.
On Iowa, GPT-4.1 missed 8 of 30 questions. The questions it got wrong include:
- “What is the minimum age to apply for a learner's permit in Iowa?”
- “What is the minimum age to apply for a school permit in Iowa?”
- “You are a 16-year-old with a provisional license. You pick up a friend who is 17 years old. What must you do?”
- “A 16-year-old has an intermediate license. They want to drive a friend home at 11:30 PM. What should they do?”
- “Who must wear a safety belt in an Iowa vehicle?”
- “How many questions are on the Iowa knowledge test, and how many must you answer correctly to pass?”
On Vermont, GPT-4.1 missed 7 of 30 questions. All seven were the same kind of administrative question rephrased: how many questions are on the Vermont permit test, what passing score is required, and where exactly you go to take it.
The pattern is consistent. When GPT-4.1 misses a permit-test question, it's overwhelmingly because the question asks about a number, an age, a curfew time, a passenger-restriction rule, or a procedural detail that varies state-by-state. The model cleanly handles “what does a flashing red light mean,” “who has the right of way at a four-way stop,” and “at what speed should you slow down for ice.” It does not reliably know that Iowa offers a school permit at age 14 and two months, that Vermont's permit test has 20 questions requiring 16 correct, or that an intermediate license in Iowa carries an 11 PM driving curfew with sibling-passenger exceptions.
That is, in retrospect, exactly what we should expect. AI models trained on the public internet have absorbed the universal driving rulebook — the principles every state shares because traffic physics doesn't change at state lines. They have not absorbed, and probably never will memorize cleanly, the 51-jurisdiction patchwork of administrative law that surrounds licensing. State-specific knowledge is exactly what permit-prep apps add over what a chatbot provides for free.
The open-weight result: not yet
Commercial AI is one story. The bigger question for anyone running on-prem or self-hosted models is: can an open-weight model that fits on a single workstation do the same job? In May 2026, the answer is no.
NVIDIA's nemotron-3-nano:30b is a 30-billion-parameter reasoning model — large enough to be a serious open-weight candidate, small enough to run on a single DGX Spark workstation without a datacenter. We ran it through the same 30-question samples on 46 of 51 states (the run errored on the last five alphabetically due to a GPU-memory eviction). It never passed.
- Average score: 42.8%, well below the 80% pass threshold.
- States passed: 0 of 46.
- Best state: Louisiana, 67%.
- Worst: Minnesota, Mississippi, Colorado, Connecticut, and Idaho — all clustered between 30% and 33%.
The gap between commercial GPT-4.1 (90% average) and a strong open-weight 30B model (43% average) is the headline. Open-weight reasoning models are improving quickly, and a 70B-class model on heavier hardware would likely close some of the gap. But as of this writing, dropping a small open-weight model into a permit-prep product as a free substitute for a commercial API call would cut accuracy roughly in half.
We attempted three additional Spark-local models — Meta's llama4:scout (78GB), Alibaba's qwen3:32b, and Meta's llama3.3:70b. All three hit infrastructure problems before producing usable data: model-load timeouts on the largest, response-field bugs in the local Ollama daemon for one, and GPU-memory eviction during the 51-state batch for another. Their results are excluded from the leaderboard. Anyone with a Grok or Anthropic API key can rerun the harness against those models — the code is published with the dataset.
How we ran the experiment
The setup is deliberately simple, so anyone can reproduce it. For each of the 51 U.S. jurisdictions (the 50 states plus the District of Columbia), we drew a 30-question random sample from the state's full official permit-test bank. Every model saw the same 30 questions in the same order, with the same prompt, at temperature 0.0 (deterministic decoding). A model passes a state if it scores 80% or higher — the typical state-DMV passing score.
The question banks are sourced from published state DMV handbooks plus regulator-approved third-party study guides. They are representative of state knowledge-test content but are not the verbatim test the state administers. That distinction matters for anyone interpreting absolute numbers, but the relative comparison across models holds because every model saw the identical questions.
The prompt below is what each model received before every question. It is intentionally minimal — no chain-of-thought instructions, no retrieval, no system messages tuned to driver education. We wanted to measure each model's baseline knowledge as cleanly as possible:
You are taking the {state} state DMV learner-permit knowledge test.
Read the question and four choices. Reply with EXACTLY one capital letter
(A, B, C, or D) and nothing else.
Question: {question}
Choices:
{choices}
Your answer (one letter only):For the technical reader: the prompt's SHA-256 short hash 34e86449ee017d0f is recorded in every result row, so a reader can verify identical prompting across every cell of the experiment. For Gemini 2.5 Flash and the nemotron reasoning model, max-output-tokens was raised above one to allow hidden reasoning tokens before the final letter; the answer regex still extracts a single capital letter. The 102 commercial-API evaluations cost approximately $7 in API spend across OpenAI and Gemini. The 46-state nemotron run was free, executed on a DGX Spark workstation behind a local Ollama endpoint.
The commercial models in detail
GPT-4.1 — the obvious winner
OpenAI's GPT-4.1 cleared 80% in 49 of 51 states with an average of 90.1%. Best performance: 100% on multiple states including California, Texas, and Florida. Worst performance: 73% on Iowa and 77% on Vermont — both attributable, as discussed, to state-specific procedural questions rather than general driving knowledge. If you treat the permit test as a knowledge benchmark, GPT-4.1 is at the level of an above-average 16-year-old who's done a few practice tests but hasn't memorized their state's administrative quirks.
Gemini 2.5 Flash — the borderline case
Google's Gemini 2.5 Flash averaged exactly 80.0% — right at the pass threshold. It passed in 32 of 51 states (63% pass rate), with a wide spread: cleanly passes most western and large-population states, struggles in smaller states with stricter test banks. The average masks variance: when it's good, it's 90%+; when it's bad, it's 60s. The hidden-reasoning model architecture means each call costs more in latency than GPT-4.1, even though the output is just a letter.
Discussion: what these numbers mean
Three implications, in descending order of confidence.
For permit-prep apps and study tools.The AI-fails-on-state-specifics finding is the actual moat. Anyone can plug a question into ChatGPT and get a reasonable answer for general driving knowledge. Nobody can reliably ask a chatbot “what's the curfew for an Iowa intermediate license” and get the right answer. State-by-state permit prep — handbook text actually pulled from the state, practice tests written against the real bank, current rules and fees — is what a real product adds on top of free general AI. That is precisely the gap dmvpermit.com and similar tools fill.
For autonomous-driving and AI assistants.The 90% general-driving-knowledge score from GPT-4.1 is, frankly, higher than we expected before running the experiment. AI models have absorbed driving principles well enough to be useful as conversational reference material — a parent helping a teenager study, a new driver reviewing right-of-way rules. The 42% average from a strong open-weight 30B model is the floor reminder: smaller models still don't cluster the knowledge cleanly enough to depend on without a citation layer or a retrieval system bolted on.
For benchmarking AI generally.State permit tests turn out to be a useful, cheap, replicable knowledge benchmark. Each state's test is publicly published, the answer keys are known, the questions are unambiguous, and the 30-question sample per state gives reasonable variance bounds. We're publishing the dataset (CC BY 4.0) so anyone building a knowledge benchmark can include this category.
Limitations
- 30-question sample. Real state knowledge tests range from 20 to 50 questions. Larger samples would tighten variance bounds; cost would scale linearly. We chose 30 as the modal real-world test length.
- Question bank source. Questions are drawn from the ModulesFactory state-DMV packages, sourced from state handbooks plus regulator-approved third-party guides. Representative but not the verbatim state-administered test.
- Two commercial models, not all of them.Grok-4 and Anthropic's Claude were not evaluated due to API key availability in this run. The harness supports them; rerun with credentials.
- Open-weight evaluation incomplete. nemotron is 46/51 states; the missing five (last-alphabet states) errored due to a Spark GPU-memory eviction during the run. The pattern from the 46 sampled states is consistent enough to draw the conclusion. Three other Spark-local models hit similar infrastructure problems; results excluded.
Conclusion
The headline answer is yes — the leading commercial AI models can pass U.S. state DMV permit tests, and at a comfortable margin. GPT-4.1 passes 49 of 51 states; Gemini 2.5 Flash passes 32. The interesting finding is that the only places these models fail are on questions about state-specific administrative rules: minimum ages, curfew hours, passenger restrictions, the precise number of questions on the test itself. The universal driving rulebook is something the public internet has taught the models well; the state-by-state administrative patchwork is not.
The open-weight story is different. A 30-billion-parameter open-weight reasoning model running on a single workstation averaged 43% and passed zero states. Smaller open-weight models are getting dramatically better, but they are not yet a drop-in replacement for a commercial API call when accuracy matters. Anyone deploying on-prem will need either heavier hardware, a retrieval layer, or a task-specific fine-tune to close the gap.
Practically: if you are studying for a permit test, ChatGPT can help you understand the universal rules of the road. It cannot reliably tell you the rules of your state. That is the line where free AI ends and a real permit-prep product begins.
Data availability
The full per-state, per-model dataset — every prompt, every raw response, every score — is published as ai-vs-dmv-permit-test-2026.json under CC BY 4.0. Cite as: dmvpermit.com, AI vs DMV — Can the Leading AI Models Pass U.S. State Permit Tests?, May 2026. Press inquiries: ronan@dmvpermit.com.