Once, I had written a Medium article about Yann LeCun’s vision, and the idea of building Joint-Based Embeddings plus an Energy-Based Model had really stuck in my mind. If Yann LeCun was saying it, it was worth looking into. And I built a tiny fake JEPA+EBM, and it actually behaved like a grown-up model.

Why I Got Obsessed With This
I’ve always liked LeCun’s thesis that “the real deal” for intelligence is not just next-token prediction, but energy-based models operating over rich latent spaces.
The story, in my own words:
- You don’t just want a model that says “likely next token is the”.
- You want a system that says:
- “Given this context and this hypothesis about the world, how compatible are they?”
- That compatibility is encoded as a single scalar: an energy.
- Low energy = good configuration, high energy = bad configuration.
JEPA (Joint Embedding Predictive Architecture) then layers on top of that idea:
embed different views (context vs future, masked region vs visible region, text vs image, etc.) into a shared latent space, and then reason in that space rather than directly in pixel/token space.
I’m not reproducing LeCun’s full research program. What I wanted was something like:
“Can I glue together today’s practical tools (GloVe, OpenAI embeddings, GPT-5.1’s tool calling) into a tiny, working JEPA-flavored energy model that I can actually poke at?”
The answer turned out to be: yes, but in a very hacky, fun way.
The High-Level Idea
I structured the system around a simple question:
Given a context sentence and a candidate sentence, can I assign a scalar “energy” that tells me how compatible they are?
To do that, I built a small stack:
Representations
- GloVe: old-school static word embeddings, averaged into a sentence vector.
- OpenAI text embeddings: modern, contextual sentence embeddings.
Joint embedding
- Combine GloVe and OpenAI embeddings into a single vector that lives in a shared space.
Energy function
- Turn similarity between joint embeddings into an “energy” score.
Orchestration
- Wrap the whole thing as a tool that GPT-5.1 can call via the Responses API.
- Let GPT-5.1 interpret the energy and explain what it means.
So the architecture is less “I trained a grand model” and more “I wired together pre-trained pieces into something that behaves like a toy EBM”.
How I Represent Context and Candidate
1. GloVe: static lexical geometry
First, I load a big GloVe file from disk (glove.6B.300d.txt in my case). That gives me a dictionary:
- each word → 300-dimensional vector
For a sentence like:
“A child is playing with a red ball in the park.”
I Tokenize into lowercase word-like chunks:
["a", "child", "is", "playing", "with", "a", "red", "ball", "in", "the", "park"]
I then look up each token in the GloVe table (if it exists) and average all those vectors into one sentence-level GloVe vector.
This gives me a bag-of-words flavored representation: it doesn’t understand word order, but it’s very sensitive to lexical overlap and local co-occurrence structure.
2. GPT embeddings: contextual geometry
Next, I call the OpenAI embedding model (text-embedding-3-small) on both sentences:
- I send [context, candidate] in a single call.
I get back two high-dimensional vectors, one per input:
- gpt_ctx — embedding of the context
- gpt_cand — embedding of the candidate
These vectors carry more semantic and contextual structure than GloVe:
- It knows “kid” and “child” are strongly related.
- It understands that “throwing a ball across the playground” is a specific kind of “playing with a ball”.
3. Building a joint embedding
Now I want a single latent representation for each side that mixes both views.
Firstly, I Force both to have the same dimensionality
- I keep GloVe at its native size (300).
- The GPT embedding is larger, so I project it to 300 dimensions in a very simple way (truncate or pad with zeros). This is not fancy, but it gets us matching shapes.
Then, I Normalize each vector
- I scale each of the four vectors (glove_ctx, glove_cand, gpt_ctx_proj, gpt_cand_proj) to unit length. That way, I care about direction more than magnitude.
Lastly, I Concatenate the two views
For each sentence, I create:
- joint context vector = [normalized GloVe; normalized GPT]
- joint candidate vector = [normalized GloVe; normalized GPT]
So each sentence becomes a 600-dimensional joint embedding that is half “static lexical”, half “contextual semantic”.
This is my cheap, hand-crafted joint latent space.
How I Turn Similarity Into “Energy”
Once I have those joint vectors, the rest is conceptually simple:
- I measure how aligned the context and candidate joint embeddings are.
- Closer alignment → higher similarity → lower energy.
Technically, it’s just a similarity score turned upside-down to become an energy. There’s no learned parameter here, no training loop, no gradients. It’s a hand-designed energy function backed by pre-trained embedding spaces.
The effect is:
- Context and candidate that describe the same scene (e.g., child playing with a red ball vs kid throwing the red ball) produce low energy.
- Completely unrelated or contradictory sentences produce higher energy.
It’s a tiny, frozen energy landscape over sentence pairs.
How GPT-5.1 Fits Into This
All of this lives inside a Python function:
evaluate_joint_embedding_energy(context, candidate, verbose)
I register that function as a tool in the OpenAI Responses API. Tools are described to the model with a JSON schema:
- name of the function
- parameters (context, candidate, verbose)
- what the function does in natural language
Then the interaction looks like this:
I send GPT-5.1 a user message:
- “You are simulating a simple joint-embedding + energy-based model…
- Here’s a context and a candidate.
- Use the evaluate_joint_embedding_energy tool to score them, then explain the result.”
GPT-5.1 decides to call the tool. It emits a tool call with arguments:
{
"name": "evaluate_joint_embedding_energy",
"arguments": {
"context": "...",
"candidate": "...",
"verbose": true
}
}
My Python code sees that, runs the local function:
- loads GloVe (once)
- calls the embeddings API
- builds the joint embeddings
- computes the energy and some diagnostics
I send the tool output back into the Responses API as function_call_output:

GPT-5.1 now has both:
- the original text pairs
- the numeric energy / similarity scores
and it generates a natural-language explanation like:
- “Energy is low, because these two sentences both describe a child playing with a ball in a park-like setting; the differences are small stylistic variants.”
So GPT-5.1 becomes a controller and explainer, while the energy model function is a local numerical primitive that touches the GloVe file and the embeddings API.
What I Felt Like When I Run It
When I run the script with:
- Context: “A child is playing with a red ball in the park.”
- Candidate: “The kid happily throws the bright red ball across the playground.”
I see:
- The tool being called with exactly those arguments.
- GloVe vectors loading from disk.
- A final energy around 0.22 (on a loose [0, 2]-ish scale where lower is better).
- Joint similarity around 0.78, GloVe-only similarity a bit higher, GPT-only similarity a bit lower.

Then GPT-5.1 writes a short explanation saying, effectively:
- These sentences describe the same event with minor wording changes.
- The model considers the candidate a highly compatible rephrasing of the context.
- That’s why the energy is low.
It’s not magic, but it’s satisfying to see a simple structure behave in a way that matches intuition.
What’s missing versus a real EBM/JEPA?
- No learned parameters in the energy.
Energy is a fixed function of pre-trained embeddings. - No negative sampling.
We never show the system “bad” (context, candidate) pairs to push energy up. - No sampling from the model.
True EBMs often involve MCMC or other methods to sample low-energy configurations. - No asymmetry.
True conditional EBMs might care about direction (e.g., E(x→y)E(x→y)), while cosine is symmetric.
So this is structurally:
- A scoring module that emulates the shape of an EBM,
- On top of frozen representations,
- Used by GPT-5.1 as a tool to inform its reasoning and explanation.
In short, In short: I’ve glued together
- GloVe (static lexical geometry),
- GPT embeddings (contextual geometry),
- and a cosine-based energy function,
and wrapped the whole thing as a callable tool that GPT-5.1 can orchestrate via the Responses API. Conceptually it behaves like a tiny, frozen EBM sitting under a JEPA-style joint latent, even though no training is actually happening.
Why I Like This Toy
Even with all those missing pieces, I think this kind of hack is useful:
- It makes the abstract “energy-based model” idea concrete: you can print numbers, sort candidates by energy, visualize how compatibility behaves.
- It shows how to wrap nontrivial numerical machinery behind a tool and let a large language model orchestrate it.
It demonstrates a simple pattern:
- put heavy lifting (embeddings, local files, linear algebra) in tools
- let GPT reason about what to call, when, and how to interpret the result
If I ever decide to push this closer to a real EBM, I already have:
- a way to define an energy over joint representations
- a way to generate positives and negatives
- a loop that can call into that scoring function as part of a larger system
For now, I’m happy with this tiny, fake JEPA+EBM that behaves just enough like a grown-up model to be interesting. 🙂

This story is published on Generative AI. Connect with us on LinkedIn and follow Zeniteq to stay in the loop with the latest AI stories.
Subscribe to our newsletter and YouTube channel to stay updated with the latest news and updates on generative AI. Let’s shape the future of AI together!

I implemented Yann LeCun’s Energy Based Model Idea using Python was originally published in Generative AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
Source link




Add comment