Preparing Text for RAG
Adding vector search to the graph — creating a vector index (dimensions + similarity function), generating embeddings with genai.vector.encode and storing them as node properties, top-K similarity search with db.index.vector.queryNodes, parameterised queries for security, Neo4j as a co-located vector store, and how to read similarity scores.
On this page
Creating a vector index
A vector index tells Neo4j which node
property holds embeddings, how many dimensions they have, and which similarity
function to compare them with. It’s idempotent with IF NOT EXISTS: [V] Verified
CREATE VECTOR INDEX movie_tagline_embeddings IF NOT EXISTS
FOR (m:Movie) ON (m.taglineEmbedding)
OPTIONS { indexConfig: {
`vector.dimensions`: 1536,
`vector.similarity_function`: 'cosine'
}}
| Parameter | Value | Why |
| --- | --- | --- |
| vector.dimensions | 1536 | OpenAI’s default embedding size — must match the model exactly |
| vector.similarity_function | cosine | OpenAI’s recommendation; normalises away vector magnitude |
| IF NOT EXISTS | — | idempotent creation, safe to re-run |
Write the Cypher to create a vector index on a Chunk node's textEmbedding property (1536 dims, cosine), and say what status means it's ready. KGR-3.1
CREATE VECTOR INDEX chunk_embeddings IF NOT EXISTS FOR (c:Chunk) ON (c.textEmbedding) OPTIONS {indexConfig: {vector.dimensions: 1536, vector.similarity_function: 'cosine'}}. It’s ready when SHOW INDEXES reports its status as ONLINE (not POPULATING). The dimension must match the embedding model’s output exactly, or queries fail with a dimension-mismatch error.
Generating and storing embeddings
Neo4j’s genai.vector.encode calls the
embedding API from inside Cypher and writes the result onto the node. The API key
goes in as a query parameter, never the query string:
MATCH (m:Movie) WHERE m.tagline IS NOT NULL
WITH m, genai.vector.encode(m.tagline, "OpenAI", { token: $openAiApiKey }) AS vector
CALL db.create.setNodeVectorProperty(m, "taglineEmbedding", vector)
// verify: tagline text, first values, and the vector size (1536)
MATCH (m:Movie {title: "The Matrix"})
RETURN m.tagline, m.taglineEmbedding[0..5] AS sample, size(m.taglineEmbedding) AS dims
An embedding is a dense vector where
semantically similar texts sit close together; OpenAI’s text-embedding-3-small
outputs 1536 dimensions (its -3-large outputs 3072), which is exactly why the
index above declares 1536.
What does genai.vector.encode do, and what three arguments does it take? KGR-3.2
It calls an external embedding API (e.g. OpenAI) from inside Cypher and returns the embedding as a float array. Its three arguments are the text to encode, the provider name ("OpenAI"), and a config map (including the API token, passed as a query parameter). Paired with db.create.setNodeVectorProperty, it stores the vector directly on the node — co-locating vectors with the graph.
Searching by similarity
With embeddings indexed, encode the question and find nearest neighbours with
db.index.vector.queryNodes (top-K):
WITH genai.vector.encode($question, "OpenAI", { token: $openAiApiKey }) AS qEmbedding
CALL db.index.vector.queryNodes('movie_tagline_embeddings', $topK, qEmbedding)
YIELD node AS movie, score
RETURN movie.title, movie.tagline, score
For “movies about love” the top hits are Joe vs. the Volcano (0.89) and When Harry Met Sally (0.87); switch the question to “adventure” and entirely different movies surface — the search is semantic, not keyword-based.
Write a similarity-search query that encodes a question, searches a movie_tagline_embeddings index, and returns titles with scores. KGR-3.3
WITH genai.vector.encode($question, "OpenAI", {token: $openAiApiKey}) AS q CALL db.index.vector.queryNodes('movie_tagline_embeddings', $topK, q) YIELD node AS movie, score RETURN movie.title, score. Encode the question with the same model used for the stored embeddings, call queryNodes with the index name + top-K + query vector, and project title + score. All inputs ($question, $openAiApiKey, $topK) are query parameters.
Parameterised queries: security and speed
# WRONG: key in the query string
kg.query(f'... {{ token: "{api_key}" }} ...')
# RIGHT: key as a parameter
kg.query("... { token: $openAiApiKey } ...", params={"openAiApiKey": api_key})
Parameters buy security and plan caching
KGR-3.5Query parameters keep secrets and user input
out of the query string (no injection, no leakage in logs) and let Neo4j
cache the execution plan — the same parameterised query reused across calls
skips re-planning. When it breaks: DDL statements (CREATE INDEX,
CREATE CONSTRAINT) can’t take parameters, so sanitise those inputs manually. [V] Verified
Why pass an API key as a query parameter instead of interpolating it into the Cypher string? Name two benefits. KGR-3.5
Security — the key never appears in the query string, so it can’t leak into logs/error messages/the query cache, and there’s no injection surface. Performance — Neo4j caches the execution plan for a parameterised query and reuses it across calls with different values, skipping re-planning. (Caveat: DDL like CREATE INDEX can’t be parameterised.)
Neo4j as a vector store
Most RAG stacks bolt a separate vector database onto their primary store. Neo4j’s vector indexes remove that split — embeddings are node properties in the same database as the graph, so one query can search by similarity, traverse relationships, and return both the matched text and its graph context. [V] Verified
| Capability | Dedicated vector DB | Neo4j | | --- | --- | --- | | Vector similarity search | yes | yes | | Graph traversal | no | yes | | Full-text search | sometimes | yes | | Combined vector + graph query | no | yes | | Systems to operate | two | one |
What is the single biggest advantage of co-locating embeddings in Neo4j versus a separate vector database, and when might the separate DB still win? KGR-3.6
The advantage is combined queries: one Cypher query can do vector similarity search and graph traversal (retrieve a chunk, then follow its relationships) in a single round trip, with one system to operate. A dedicated vector DB can still win when the vector workload is large and standalone and you need advanced ANN algorithms or frequent embedding-model swaps that Neo4j’s vector support doesn’t match — then best-of-breed beats single-engine.
Reading similarity scores
When you evaluate results, read the shape, not just the top value: a clear winner (0.92 vs 0.75 for the rest) is high confidence; a flat band (0.78–0.82) means an ambiguous query or no strong match. A score threshold (say 0.7) filters weak context that would otherwise tempt the LLM to hallucinate — and scores are relative, so 0.85 on movie taglines is not the same quality as 0.85 on SEC filings.
A similarity search returns uniformly low scores (all ~0.4). Name two possible causes. KGR-3.7
Two of: the query is out of the corpus’s domain (nothing stored is genuinely similar, so even the nearest neighbours are far); a dimension/model mismatch (the query was encoded with a different model than the stored embeddings); embeddings were never populated or the index isn’t ONLINE; or the corpus genuinely lacks relevant content. Low absolute scores can also just mean this corpus’s scores run low — read the distribution, not one number.
Summary
A vector index declares the embedding property, its dimensionality, and the
similarity function. genai.vector.encode generates embeddings from inside Cypher
and stores them as node properties — co-located with the graph, so no separate
vector DB. db.index.vector.queryNodes does top-K nearest-neighbour search; query
parameters keep API keys safe and enable plan caching; and cosine scores are
relative signals to be read as a distribution, not absolute truth. Right now Neo4j
is “just” a vector store — the transformative combination of vector search with
graph traversal arrives in Module 5.
- Chapter 4 — constructing the graph: SEC 10-K chunks become nodes, indexed for vector search.
- Chapters 5–6 — adding relationships and expanding the SEC graph.
- Chapter 7 — combined retrieval with LLM-generated Cypher.