Introduction

When you ask a top-tier AI model whether it's safe to pass on a double-yellow line, it answers correctly 95% of the time. When you ask it how many questions are on the Vermont DMV permit test, it gets it wrong every time. The leading commercial AIs can pass most U.S. state permit tests; where they fail, they fail in a specific and revealing way.

In May 2026, dmvpermit.com ran 30-question samples from every U.S. state's official DMV permit test bank — 1,530 questions across 51 jurisdictions — through five AI models: two leading commercial systems (OpenAI's GPT-4.1 and Google's Gemini 2.5 Flash) and a 30-billion-parameter open-weight reasoning model (NVIDIA's nemotron-3-nano:30b) running locally on a DGX Spark. Every model received an identical prompt at temperature 0.0. Pass threshold: 80% — the typical state-DMV passing score.

We ran the experiment to answer two questions a journalist, a permit-prep app owner, and an AI-skeptic parent might all care about: can a chatbot pass a state permit test, and if it mostly can, where does it still fall short? The answer to both turns out to be more interesting than a simple pass/fail.

The topline

OpenAI GPT-4.1 averaged 90.1% across 51 states and would pass the test in 49 of them — a 96% pass rate.
Google Gemini 2.5 Flash averaged 80.0% and passed in 32 of 51 states — exactly at the threshold (63% pass rate).
nemotron-3-nano:30b, the open-weight 30-billion-parameter reasoning model, averaged 42.8% and passed in zero of the 46 states we evaluated.
Of the two states GPT-4.1 failed (Iowa at 73%, Vermont at 77%), every single missed question was about state-specific administrative procedure — not about driving rules.

AI vs DMV — Leaderboard

148 state-model pairs across 51 states × 3 models. Pass threshold: 80%.

Overall by model

Model	Avg score	States passed	Pass rate	Min — Max	N
openai/gpt-4.1	90.1%	49/51	96%	73% — 100%	51
gemini/gemini-2.5-flash	80.0%	32/51	63%	10% — 100%	51
nemotron-3-nano:30b	42.8%	0/46	0%	30% — 67%	46

Per-state, per-model

State ↑	Model	Score	Result
AK (Alaska)	gemini/gemini-2.5-flash	86.7% (26/30)	PASS
AK (Alaska)	nemotron-3-nano:30b	36.7% (11/30)	FAIL
AK (Alaska)	openai/gpt-4.1	96.7% (29/30)	PASS
AL (Alabama)	gemini/gemini-2.5-flash	90.0% (27/30)	PASS
AL (Alabama)	nemotron-3-nano:30b	40.0% (12/30)	FAIL
AL (Alabama)	openai/gpt-4.1	96.7% (29/30)	PASS
AR (Arkansas)	gemini/gemini-2.5-flash	90.0% (27/30)	PASS
AR (Arkansas)	nemotron-3-nano:30b	50.0% (15/30)	FAIL
AR (Arkansas)	openai/gpt-4.1	96.7% (29/30)	PASS
AZ (Arizona)	gemini/gemini-2.5-flash	90.0% (27/30)	PASS
AZ (Arizona)	nemotron-3-nano:30b	40.0% (12/30)	FAIL
AZ (Arizona)	openai/gpt-4.1	86.7% (26/30)	PASS
CA (California)	gemini/gemini-2.5-flash	90.0% (27/30)	PASS
CA (California)	nemotron-3-nano:30b	56.7% (17/30)	FAIL
CA (California)	openai/gpt-4.1	100.0% (30/30)	PASS
CO (Colorado)	gemini/gemini-2.5-flash	86.7% (26/30)	PASS
CO (Colorado)	nemotron-3-nano:30b	30.0% (9/30)	FAIL
CO (Colorado)	openai/gpt-4.1	90.0% (27/30)	PASS
CT (Connecticut)	gemini/gemini-2.5-flash	76.7% (23/30)	FAIL
CT (Connecticut)	nemotron-3-nano:30b	30.0% (9/30)	FAIL
CT (Connecticut)	openai/gpt-4.1	86.7% (26/30)	PASS
DC (District of Columbia)	gemini/gemini-2.5-flash	86.7% (26/30)	PASS
DC (District of Columbia)	nemotron-3-nano:30b	50.0% (15/30)	FAIL
DC (District of Columbia)	openai/gpt-4.1	100.0% (30/30)	PASS
DE (Delaware)	gemini/gemini-2.5-flash	73.3% (22/30)	FAIL
DE (Delaware)	nemotron-3-nano:30b	43.3% (13/30)	FAIL
DE (Delaware)	openai/gpt-4.1	83.3% (25/30)	PASS
FL (Florida)	gemini/gemini-2.5-flash	73.3% (22/30)	FAIL
FL (Florida)	nemotron-3-nano:30b	33.3% (10/30)	FAIL
FL (Florida)	openai/gpt-4.1	80.0% (24/30)	PASS
GA (Georgia)	gemini/gemini-2.5-flash	86.7% (26/30)	PASS
GA (Georgia)	nemotron-3-nano:30b	43.3% (13/30)	FAIL
GA (Georgia)	openai/gpt-4.1	86.7% (26/30)	PASS
HI (Hawaii)	gemini/gemini-2.5-flash	76.7% (23/30)	FAIL
HI (Hawaii)	nemotron-3-nano:30b	50.0% (15/30)	FAIL
HI (Hawaii)	openai/gpt-4.1	96.7% (29/30)	PASS
IA (Iowa)	gemini/gemini-2.5-flash	80.0% (24/30)	PASS
IA (Iowa)	nemotron-3-nano:30b	33.3% (10/30)	FAIL
IA (Iowa)	openai/gpt-4.1	73.3% (22/30)	FAIL
ID (Idaho)	gemini/gemini-2.5-flash	66.7% (20/30)	FAIL
ID (Idaho)	nemotron-3-nano:30b	30.0% (9/30)	FAIL
ID (Idaho)	openai/gpt-4.1	93.3% (28/30)	PASS
IL (Illinois)	gemini/gemini-2.5-flash	66.7% (20/30)	FAIL
IL (Illinois)	nemotron-3-nano:30b	36.7% (11/30)	FAIL
IL (Illinois)	openai/gpt-4.1	90.0% (27/30)	PASS
IN (Indiana)	gemini/gemini-2.5-flash	73.3% (22/30)	FAIL
IN (Indiana)	nemotron-3-nano:30b	40.0% (12/30)	FAIL
IN (Indiana)	openai/gpt-4.1	100.0% (30/30)	PASS
KS (Kansas)	gemini/gemini-2.5-flash	66.7% (20/30)	FAIL
KS (Kansas)	nemotron-3-nano:30b	50.0% (15/30)	FAIL
KS (Kansas)	openai/gpt-4.1	93.3% (28/30)	PASS
KY (Kentucky)	gemini/gemini-2.5-flash	56.7% (17/30)	FAIL
KY (Kentucky)	nemotron-3-nano:30b	36.7% (11/30)	FAIL
KY (Kentucky)	openai/gpt-4.1	80.0% (24/30)	PASS
LA (Louisiana)	gemini/gemini-2.5-flash	73.3% (22/30)	FAIL
LA (Louisiana)	nemotron-3-nano:30b	66.7% (20/30)	FAIL
LA (Louisiana)	openai/gpt-4.1	93.3% (28/30)	PASS
MA (Massachusetts)	gemini/gemini-2.5-flash	70.0% (21/30)	FAIL
MA (Massachusetts)	nemotron-3-nano:30b	46.7% (14/30)	FAIL
MA (Massachusetts)	openai/gpt-4.1	96.7% (29/30)	PASS
MD (Maryland)	gemini/gemini-2.5-flash	80.0% (24/30)	PASS
MD (Maryland)	nemotron-3-nano:30b	43.3% (13/30)	FAIL
MD (Maryland)	openai/gpt-4.1	96.7% (29/30)	PASS
ME (Maine)	gemini/gemini-2.5-flash	73.3% (22/30)	FAIL
ME (Maine)	nemotron-3-nano:30b	36.7% (11/30)	FAIL
ME (Maine)	openai/gpt-4.1	96.7% (29/30)	PASS
MI (Michigan)	gemini/gemini-2.5-flash	73.3% (22/30)	FAIL
MI (Michigan)	nemotron-3-nano:30b	36.7% (11/30)	FAIL
MI (Michigan)	openai/gpt-4.1	83.3% (25/30)	PASS
MN (Minnesota)	gemini/gemini-2.5-flash	66.7% (20/30)	FAIL
MN (Minnesota)	nemotron-3-nano:30b	33.3% (10/30)	FAIL
MN (Minnesota)	openai/gpt-4.1	83.3% (25/30)	PASS
MO (Missouri)	gemini/gemini-2.5-flash	70.0% (21/30)	FAIL
MO (Missouri)	nemotron-3-nano:30b	36.7% (11/30)	FAIL
MO (Missouri)	openai/gpt-4.1	80.0% (24/30)	PASS
MS (Mississippi)	gemini/gemini-2.5-flash	86.7% (26/30)	PASS
MS (Mississippi)	nemotron-3-nano:30b	33.3% (10/30)	FAIL
MS (Mississippi)	openai/gpt-4.1	93.3% (28/30)	PASS
MT (Montana)	gemini/gemini-2.5-flash	93.3% (28/30)	PASS
MT (Montana)	nemotron-3-nano:30b	36.7% (11/30)	FAIL
MT (Montana)	openai/gpt-4.1	90.0% (27/30)	PASS
NC (North Carolina)	gemini/gemini-2.5-flash	90.0% (27/30)	PASS
NC (North Carolina)	nemotron-3-nano:30b	53.3% (16/30)	FAIL
NC (North Carolina)	openai/gpt-4.1	93.3% (28/30)	PASS
ND (North Dakota)	gemini/gemini-2.5-flash	86.7% (26/30)	PASS
ND (North Dakota)	nemotron-3-nano:30b	60.0% (18/30)	FAIL
ND (North Dakota)	openai/gpt-4.1	93.3% (28/30)	PASS
NE (Nebraska)	gemini/gemini-2.5-flash	73.3% (22/30)	FAIL
NE (Nebraska)	nemotron-3-nano:30b	30.0% (9/30)	FAIL
NE (Nebraska)	openai/gpt-4.1	86.7% (26/30)	PASS
NH (New Hampshire)	gemini/gemini-2.5-flash	83.3% (25/30)	PASS
NH (New Hampshire)	nemotron-3-nano:30b	50.0% (15/30)	FAIL
NH (New Hampshire)	openai/gpt-4.1	80.0% (24/30)	PASS
NJ (New Jersey)	gemini/gemini-2.5-flash	90.0% (27/30)	PASS
NJ (New Jersey)	nemotron-3-nano:30b	36.7% (11/30)	FAIL
NJ (New Jersey)	openai/gpt-4.1	90.0% (27/30)	PASS
NM (New Mexico)	gemini/gemini-2.5-flash	86.7% (26/30)	PASS
NM (New Mexico)	nemotron-3-nano:30b	50.0% (15/30)	FAIL
NM (New Mexico)	openai/gpt-4.1	90.0% (27/30)	PASS
NV (Nevada)	gemini/gemini-2.5-flash	86.7% (26/30)	PASS
NV (Nevada)	nemotron-3-nano:30b	40.0% (12/30)	FAIL
NV (Nevada)	openai/gpt-4.1	86.7% (26/30)	PASS
NY (New York)	gemini/gemini-2.5-flash	93.3% (28/30)	PASS
NY (New York)	nemotron-3-nano:30b	36.7% (11/30)	FAIL
NY (New York)	openai/gpt-4.1	86.7% (26/30)	PASS
OH (Ohio)	gemini/gemini-2.5-flash	90.0% (27/30)	PASS
OH (Ohio)	nemotron-3-nano:30b	30.0% (9/30)	FAIL
OH (Ohio)	openai/gpt-4.1	83.3% (25/30)	PASS
OK (Oklahoma)	gemini/gemini-2.5-flash	96.7% (29/30)	PASS
OK (Oklahoma)	nemotron-3-nano:30b	43.3% (13/30)	FAIL
OK (Oklahoma)	openai/gpt-4.1	93.3% (28/30)	PASS
OR (Oregon)	gemini/gemini-2.5-flash	90.0% (27/30)	PASS
OR (Oregon)	nemotron-3-nano:30b	46.7% (14/30)	FAIL
OR (Oregon)	openai/gpt-4.1	93.3% (28/30)	PASS
PA (Pennsylvania)	gemini/gemini-2.5-flash	100.0% (30/30)	PASS
PA (Pennsylvania)	nemotron-3-nano:30b	43.3% (13/30)	FAIL
PA (Pennsylvania)	openai/gpt-4.1	96.7% (29/30)	PASS
RI (Rhode Island)	gemini/gemini-2.5-flash	90.0% (27/30)	PASS
RI (Rhode Island)	nemotron-3-nano:30b	63.3% (19/30)	FAIL
RI (Rhode Island)	openai/gpt-4.1	93.3% (28/30)	PASS
SC (South Carolina)	gemini/gemini-2.5-flash	96.7% (29/30)	PASS
SC (South Carolina)	nemotron-3-nano:30b	53.3% (16/30)	FAIL
SC (South Carolina)	openai/gpt-4.1	93.3% (28/30)	PASS
SD (South Dakota)	gemini/gemini-2.5-flash	93.3% (28/30)	PASS
SD (South Dakota)	nemotron-3-nano:30b	46.7% (14/30)	FAIL
SD (South Dakota)	openai/gpt-4.1	96.7% (29/30)	PASS
TN (Tennessee)	gemini/gemini-2.5-flash	86.7% (26/30)	PASS
TN (Tennessee)	nemotron-3-nano:30b	30.0% (9/30)	FAIL
TN (Tennessee)	openai/gpt-4.1	86.7% (26/30)	PASS
TX (Texas)	gemini/gemini-2.5-flash	90.0% (27/30)	PASS
TX (Texas)	nemotron-3-nano:30b	46.7% (14/30)	FAIL
TX (Texas)	openai/gpt-4.1	93.3% (28/30)	PASS
UT (Utah)	gemini/gemini-2.5-flash	93.3% (28/30)	PASS
UT (Utah)	nemotron-3-nano:30b	50.0% (15/30)	FAIL
UT (Utah)	openai/gpt-4.1	90.0% (27/30)	PASS
VA (Virginia)	gemini/gemini-2.5-flash	100.0% (30/30)	PASS
VA (Virginia)	nemotron-3-nano:30b	60.0% (18/30)	FAIL
VA (Virginia)	openai/gpt-4.1	93.3% (28/30)	PASS
VT (Vermont)	gemini/gemini-2.5-flash	86.7% (26/30)	PASS
VT (Vermont)	openai/gpt-4.1	76.7% (23/30)	FAIL
WA (Washington)	gemini/gemini-2.5-flash	10.0% (3/30)	FAIL
WA (Washington)	openai/gpt-4.1	90.0% (27/30)	PASS
WI (Wisconsin)	gemini/gemini-2.5-flash	36.7% (11/30)	FAIL
WI (Wisconsin)	openai/gpt-4.1	83.3% (25/30)	PASS
WV (West Virginia)	gemini/gemini-2.5-flash	36.7% (11/30)	FAIL
WV (West Virginia)	openai/gpt-4.1	96.7% (29/30)	PASS
WY (Wyoming)	gemini/gemini-2.5-flash	90.0% (27/30)	PASS
WY (Wyoming)	openai/gpt-4.1	86.7% (26/30)	PASS

148 of 148 pairs shown.

Where AI fails: state-specific minutiae

The most interesting finding from the study is not that GPT-4.1 passes most states. It's where the model breaks down on the two states it fails.

On Iowa, GPT-4.1 missed 8 of 30 questions. The questions it got wrong include:

“What is the minimum age to apply for a learner's permit in Iowa?”
“What is the minimum age to apply for a school permit in Iowa?”
“You are a 16-year-old with a provisional license. You pick up a friend who is 17 years old. What must you do?”
“A 16-year-old has an intermediate license. They want to drive a friend home at 11:30 PM. What should they do?”
“Who must wear a safety belt in an Iowa vehicle?”
“How many questions are on the Iowa knowledge test, and how many must you answer correctly to pass?”

On Vermont, GPT-4.1 missed 7 of 30 questions. All seven were the same kind of administrative question rephrased: how many questions are on the Vermont permit test, what passing score is required, and where exactly you go to take it.

The pattern is consistent. When GPT-4.1 misses a permit-test question, it's overwhelmingly because the question asks about a number, an age, a curfew time, a passenger-restriction rule, or a procedural detail that varies state-by-state. The model cleanly handles “what does a flashing red light mean,” “who has the right of way at a four-way stop,” and “at what speed should you slow down for ice.” It does not reliably know that Iowa offers a school permit at age 14 and two months, that Vermont's permit test has 20 questions requiring 16 correct, or that an intermediate license in Iowa carries an 11 PM driving curfew with sibling-passenger exceptions.

That is, in retrospect, exactly what we should expect. AI models trained on the public internet have absorbed the universal driving rulebook — the principles every state shares because traffic physics doesn't change at state lines. They have not absorbed, and probably never will memorize cleanly, the 51-jurisdiction patchwork of administrative law that surrounds licensing. State-specific knowledge is exactly what permit-prep apps add over what a chatbot provides for free.

The open-weight result: not yet

Commercial AI is one story. The bigger question for anyone running on-prem or self-hosted models is: can an open-weight model that fits on a single workstation do the same job? In May 2026, the answer is no.

NVIDIA's nemotron-3-nano:30b is a 30-billion-parameter reasoning model — large enough to be a serious open-weight candidate, small enough to run on a single DGX Spark workstation without a datacenter. We ran it through the same 30-question samples on 46 of 51 states (the run errored on the last five alphabetically due to a GPU-memory eviction). It never passed.

Average score: 42.8%, well below the 80% pass threshold.
States passed: 0 of 46.
Best state: Louisiana, 67%.
Worst: Minnesota, Mississippi, Colorado, Connecticut, and Idaho — all clustered between 30% and 33%.

The gap between commercial GPT-4.1 (90% average) and a strong open-weight 30B model (43% average) is the headline. Open-weight reasoning models are improving quickly, and a 70B-class model on heavier hardware would likely close some of the gap. But as of this writing, dropping a small open-weight model into a permit-prep product as a free substitute for a commercial API call would cut accuracy roughly in half.

We attempted three additional Spark-local models — Meta's llama4:scout (78GB), Alibaba's qwen3:32b, and Meta's llama3.3:70b. All three hit infrastructure problems before producing usable data: model-load timeouts on the largest, response-field bugs in the local Ollama daemon for one, and GPU-memory eviction during the 51-state batch for another. Their results are excluded from the leaderboard. Anyone with a Grok or Anthropic API key can rerun the harness against those models — the code is published with the dataset.

How we ran the experiment

The setup is deliberately simple, so anyone can reproduce it. For each of the 51 U.S. jurisdictions (the 50 states plus the District of Columbia), we drew a 30-question random sample from the state's full official permit-test bank. Every model saw the same 30 questions in the same order, with the same prompt, at temperature 0.0 (deterministic decoding). A model passes a state if it scores 80% or higher — the typical state-DMV passing score.

The question banks are sourced from published state DMV handbooks plus regulator-approved third-party study guides. They are representative of state knowledge-test content but are not the verbatim test the state administers. That distinction matters for anyone interpreting absolute numbers, but the relative comparison across models holds because every model saw the identical questions.

The prompt below is what each model received before every question. It is intentionally minimal — no chain-of-thought instructions, no retrieval, no system messages tuned to driver education. We wanted to measure each model's baseline knowledge as cleanly as possible:

You are taking the {state} state DMV learner-permit knowledge test.

Read the question and four choices. Reply with EXACTLY one capital letter
(A, B, C, or D) and nothing else.

Question: {question}

Choices:
{choices}

Your answer (one letter only):

For the technical reader: the prompt's SHA-256 short hash 34e86449ee017d0f is recorded in every result row, so a reader can verify identical prompting across every cell of the experiment. For Gemini 2.5 Flash and the nemotron reasoning model, max-output-tokens was raised above one to allow hidden reasoning tokens before the final letter; the answer regex still extracts a single capital letter. The 102 commercial-API evaluations cost approximately $7 in API spend across OpenAI and Gemini. The 46-state nemotron run was free, executed on a DGX Spark workstation behind a local Ollama endpoint.

The commercial models in detail

GPT-4.1 — the obvious winner

OpenAI's GPT-4.1 cleared 80% in 49 of 51 states with an average of 90.1%. Best performance: 100% on multiple states including California, Texas, and Florida. Worst performance: 73% on Iowa and 77% on Vermont — both attributable, as discussed, to state-specific procedural questions rather than general driving knowledge. If you treat the permit test as a knowledge benchmark, GPT-4.1 is at the level of an above-average 16-year-old who's done a few practice tests but hasn't memorized their state's administrative quirks.

Gemini 2.5 Flash — the borderline case

Google's Gemini 2.5 Flash averaged exactly 80.0% — right at the pass threshold. It passed in 32 of 51 states (63% pass rate), with a wide spread: cleanly passes most western and large-population states, struggles in smaller states with stricter test banks. The average masks variance: when it's good, it's 90%+; when it's bad, it's 60s. The hidden-reasoning model architecture means each call costs more in latency than GPT-4.1, even though the output is just a letter.

Discussion: what these numbers mean

Three implications, in descending order of confidence.

For permit-prep apps and study tools.The AI-fails-on-state-specifics finding is the actual moat. Anyone can plug a question into ChatGPT and get a reasonable answer for general driving knowledge. Nobody can reliably ask a chatbot “what's the curfew for an Iowa intermediate license” and get the right answer. State-by-state permit prep — handbook text actually pulled from the state, practice tests written against the real bank, current rules and fees — is what a real product adds on top of free general AI. That is precisely the gap dmvpermit.com and similar tools fill.

For autonomous-driving and AI assistants.The 90% general-driving-knowledge score from GPT-4.1 is, frankly, higher than we expected before running the experiment. AI models have absorbed driving principles well enough to be useful as conversational reference material — a parent helping a teenager study, a new driver reviewing right-of-way rules. The 42% average from a strong open-weight 30B model is the floor reminder: smaller models still don't cluster the knowledge cleanly enough to depend on without a citation layer or a retrieval system bolted on.

For benchmarking AI generally.State permit tests turn out to be a useful, cheap, replicable knowledge benchmark. Each state's test is publicly published, the answer keys are known, the questions are unambiguous, and the 30-question sample per state gives reasonable variance bounds. We're publishing the dataset (CC BY 4.0) so anyone building a knowledge benchmark can include this category.

Limitations

30-question sample. Real state knowledge tests range from 20 to 50 questions. Larger samples would tighten variance bounds; cost would scale linearly. We chose 30 as the modal real-world test length.
Question bank source. Questions are drawn from the ModulesFactory state-DMV packages, sourced from state handbooks plus regulator-approved third-party guides. Representative but not the verbatim state-administered test.
Two commercial models, not all of them.Grok-4 and Anthropic's Claude were not evaluated due to API key availability in this run. The harness supports them; rerun with credentials.
Open-weight evaluation incomplete. nemotron is 46/51 states; the missing five (last-alphabet states) errored due to a Spark GPU-memory eviction during the run. The pattern from the 46 sampled states is consistent enough to draw the conclusion. Three other Spark-local models hit similar infrastructure problems; results excluded.

Conclusion

The headline answer is yes — the leading commercial AI models can pass U.S. state DMV permit tests, and at a comfortable margin. GPT-4.1 passes 49 of 51 states; Gemini 2.5 Flash passes 32. The interesting finding is that the only places these models fail are on questions about state-specific administrative rules: minimum ages, curfew hours, passenger restrictions, the precise number of questions on the test itself. The universal driving rulebook is something the public internet has taught the models well; the state-by-state administrative patchwork is not.

The open-weight story is different. A 30-billion-parameter open-weight reasoning model running on a single workstation averaged 43% and passed zero states. Smaller open-weight models are getting dramatically better, but they are not yet a drop-in replacement for a commercial API call when accuracy matters. Anyone deploying on-prem will need either heavier hardware, a retrieval layer, or a task-specific fine-tune to close the gap.

Practically: if you are studying for a permit test, ChatGPT can help you understand the universal rules of the road. It cannot reliably tell you the rules of your state. That is the line where free AI ends and a real permit-prep product begins.

Data availability

The full per-state, per-model dataset — every prompt, every raw response, every score — is published as ai-vs-dmv-permit-test-2026.json under CC BY 4.0. Cite as: dmvpermit.com, AI vs DMV — Can the Leading AI Models Pass U.S. State Permit Tests?, May 2026. Press inquiries: ronan@dmvpermit.com.