How LLM Structured Decoding works
Last week I happened to be in a discussion that involved getting an LLM to generate JSON reliably. A major frustration expressed was that no matter how much they tried the model would often fail to follow instructions during generation. I pointed out that most major vendors support some variant of "Structured Output" parsing which allows the user to provide an output schema. That happened to be a good solution to the problem but I wanted to take a moment to write up some notes about how and why it works so well.
All language models happen to have a vocabulary which is essentially a map of a Token to Token Id. This is how before making a prediction, strings are broken up and assigned to numbers to work with. A snippet of the Phi-4-mini-instruct vocabulary looks like this.
{
"\u0120NSError": 85268,
"\u0120filtro": 85269,
"\u0120vyt": 85270,
"\u0120Prefeitura": 85271,
"*sizeof": 85272,
"\u0120Continental": 85273,
"\u0120Enfin": 85274,
"???\u010a\u010a": 85275,
"-best": 85276,
"\u0120tolle": 85277,
"\u00e8\u012d\u00b9\u00e6\u0140\u013e\u00e7\u012b\u012a": 85278,
"\u0120\u00d8\u00a7\u00d9\u0126\u00d8\u00b5\u00d9\u012a\u00d8\u00b1": 85279,
"\u0120\u00c3\u00a9nerg": 85280,
"icester": 85281,
"\u0120abbiamo": 85282,
...
}
We can tokenize a string, using an instance of the tokenizer which will give us a sequence of token Ids.
prompt = "Write a json object with the following keys: name, age, city must be an object that starts with { and ends with }"
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-4-mini-instruct")
inputs = tokenizer(prompt, return_tensors="pt")
print(inputs)
Output:
{'input_ids': tensor([[10930, 261, 5701, 2817, 483, 290, 3992, 12994, 25, 1308,
11, 5744, 11, 5030, 2804, 413, 448, 2817, 484, 13217,
483, 354, 326, 17095, 483, 388]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1]])}
A prediction for the model outputs a probability distribution for each token over the entire vocabulary. This means that for each possible token, you get a score for how likely that token appears in the sequence. We can make a prediction to visualize this.
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-4-mini-instruct")
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
Now let's grab the next token and see what the highest probability tokens are
next_token_logits = logits[0, -1]
torch.topk(next_token_logits.softmax(dim=-1), 10)
Output:
torch.return_types.topk(
values=tensor([0.1901, 0.1205, 0.0866, 0.0475, 0.0461, 0.0414, 0.0316, 0.0183, 0.0180,
0.0171]),
indices=tensor([ 326, 483, 13, 887, 1366, 2804, 558, 2238, 350, 290]))
The highest probability token here is 326, which appens to be and
. Next is 483 is with
. So it's clear the model is really just trying to "complete" the prompt.
Since we have access to the predictions, Structured Decoding in this context means making more intelligent decisions about what predictions to accept from the model based on the rules/criteria we wish to apply to our output.
For example, JSON follows a strict grammar in order for a string to be valid JSON[1].
- A valid JSON object must start with a
{
- A valid JSON array must start with a
[
- A valid JSON primitive can be a string, number,
true
,false
,null
So to make a prediction that is a valid JSON object, it can only be a finite number of possible values. This means that when sampling the next token from the model's predictions, we can reject any token that would not be valid and only sample from a pool of valid alternatives[2].
valid_starts = ["{", "["]
valid_ids = [tokenizer.encode(tok, add_special_tokens=False, return_tensors="pt")[0] for tok in valid_starts]
print(valid_ids)
# Mask out everything except our valid ids
mask = torch.full_like(next_token_logits, float("-inf"))
for vid in valid_ids:
mask[vid] = next_token_logits[vid]
# Take the highest probability token from the pool of valid tokens
next_token_id = torch.argmax(mask).item()
next_token = tokenizer.decode([next_token_id])
# Print for visualization
print("Chosen token:", next_token)
Output:
Chosen token: '{'
In this example, I'm constraining the valid tokens to {
and [
. We assume every other token is invalid and mask them. Then we take the highest probability token from the pool of valid tokens.
For more elaborate control over what is valid, we need a way to define a grammar and partially match the completion. Most of the major LLM Vendors usually provide some way to define a JSON schema[3][4][5]. But more sophisticated APIs allow some sort of BNF like notation or Regex for matching on the outputs. For example VLLM uses llguidance for their implementation. A manual regex based implementation might look something like this.
pattern = regex.compile(r"^{\s*\"\w+\"\s*:\s*(?:(?:\"\w+\")|true|false|null)\s*}$")
token_ids = list(range(tokenizer.vocab_size))
token_strs = [tokenizer.decode([id]) for id in token_ids]
prompt = "Write a valid json object with a single test key and value"
completion = ""
for i in range(20):
inputs = tokenizer(f"<|user|>{prompt}<|end|><|assistant|>{completion}", return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
next_token_logits = logits[0, -1]
mask = torch.full_like(next_token_logits, float("-inf"))
for token_id, token_str in zip(token_ids, token_strs):
expected_completion = completion + token_str
if pattern.fullmatch(expected_completion, partial=True):
mask[token_id] = next_token_logits[token_id]
if torch.all(mask == float("-inf")):
# No valid alternative
break
next_token_id = torch.argmax(mask).item()
next_token = tokenizer.decode([next_token_id])
print(next_token)
completion += next_token
Output:
{
"
test
":
"
This
"
}
Hopefully this shows why Structured Output is actually guaranteed to conform to the schema barring any bugs that occur during sampling. This is a great option if you have information about the expected structure that might not be immediately clear to the model or you intend for the output to be consumed by other tools.
This is an example. For full details on the grammar the full standard is available here https://www.json.org/json-en.html. ↩︎
It's not a coincidence that it's extremely similar to DFA or other parsing techniques. You are effectively streaming lexemes and need to make decisions about what fits and what does not. ↩︎
(no date) Structured output. ai.google.dev. Available at: https://ai.google.dev/gemini-api/docs/structured-output (Accessed: 2025-9-13). ↩︎
(no date) OpenAI platform. platform.openai.com. Available at: https://platform.openai.com/docs/guides/structured-outputs (Accessed: 2025-9-13). ↩︎
(no date) Structured outputs. docs.vllm.ai. Available at: https://docs.vllm.ai/en/v0.9.2/features/structured_outputs.html (Accessed: 2025-9-13). ↩︎