Part 1 Chapter 3 Last verified 2026-06-19

Preparing Text for RAG

Adding vector search to the graph — creating a vector index (dimensions + similarity function), generating embeddings with genai.vector.encode and storing them as node properties, top-K similarity search with db.index.vector.queryNodes, parameterised queries for security, Neo4j as a co-located vector store, and how to read similarity scores.

On this page

Creating a vector index
Generating and storing embeddings
Searching by similarity
Parameterised queries: security and speed
Neo4j as a vector store
Reading similarity scores
Summary

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Performing vector similarity search; if any is shaky, read closely — each is developed below.

Predict before reading — check each against the chapter.

You created a vector index with vector.dimensions: 768 but your embeddings are 1536-dim. Predict what a similarity search returns.
You need to pass an OpenAI API key into a Cypher query. Predict the right mechanism — f-string interpolation or a query parameter — and one risk of the wrong one.
A cosine similarity search for “movies about love” returns top scores of 0.89 and 0.87. Predict roughly what 0.89 means — strong, moderate, or weak match.
Most RAG stacks run a separate vector database next to their main store. Predict the single biggest win from putting the vectors inside Neo4j instead.

Check your answers

It errors — a query whose vector dimension doesn’t match the index fails with a dimension-mismatch error (not silent wrong results); the index dimension must equal the model’s output (1536).
A query parameter ($openAiApiKey via params). An f-string leaks the key into logs, error messages, and the query cache — and invites injection.
Strong — for normalised (cosine) embeddings, ~0.89 is a strong semantic match; ~0.75 moderate, below 0.5 weak (thresholds are corpus-relative).
Co-location — one query can do vector search and graph traversal, so you retrieve a chunk and follow its relationships without a second system or round trip.

Creating a vector index

A vector index tells Neo4j which node property holds embeddings, how many dimensions they have, and which similarity function to compare them with. It’s idempotent with IF NOT EXISTS: [V] Verified

CREATE VECTOR INDEX movie_tagline_embeddings IF NOT EXISTS
FOR (m:Movie) ON (m.taglineEmbedding)
OPTIONS { indexConfig: {
  `vector.dimensions`: 1536,
  `vector.similarity_function`: 'cosine'
}}

| Parameter | Value | Why | | --- | --- | --- | | vector.dimensions | 1536 | OpenAI’s default embedding size — must match the model exactly | | vector.similarity_function | cosine | OpenAI’s recommendation; normalises away vector magnitude | | IF NOT EXISTS | — | idempotent creation, safe to re-run |

Write the Cypher to create a vector index on a Chunk node's textEmbedding property (1536 dims, cosine), and say what status means it's ready. KGR-3.1

CREATE VECTOR INDEX chunk_embeddings IF NOT EXISTS FOR (c:Chunk) ON (c.textEmbedding) OPTIONS {indexConfig: {vector.dimensions: 1536, vector.similarity_function: 'cosine'}}. It’s ready when SHOW INDEXES reports its status as ONLINE (not POPULATING). The dimension must match the embedding model’s output exactly, or queries fail with a dimension-mismatch error.

Generating and storing embeddings

Neo4j’s genai.vector.encode calls the embedding API from inside Cypher and writes the result onto the node. The API key goes in as a query parameter, never the query string:

MATCH (m:Movie) WHERE m.tagline IS NOT NULL
WITH m, genai.vector.encode(m.tagline, "OpenAI", { token: $openAiApiKey }) AS vector
CALL db.create.setNodeVectorProperty(m, "taglineEmbedding", vector)

// verify: tagline text, first values, and the vector size (1536)
MATCH (m:Movie {title: "The Matrix"})
RETURN m.tagline, m.taglineEmbedding[0..5] AS sample, size(m.taglineEmbedding) AS dims

An embedding is a dense vector where semantically similar texts sit close together; OpenAI’s text-embedding-3-small outputs 1536 dimensions (its -3-large outputs 3072), which is exactly why the index above declares 1536.

What does genai.vector.encode do, and what three arguments does it take? KGR-3.2

It calls an external embedding API (e.g. OpenAI) from inside Cypher and returns the embedding as a float array. Its three arguments are the text to encode, the provider name ("OpenAI"), and a config map (including the API token, passed as a query parameter). Paired with db.create.setNodeVectorProperty, it stores the vector directly on the node — co-locating vectors with the graph.

Searching by similarity

With embeddings indexed, encode the question and find nearest neighbours with db.index.vector.queryNodes (top-K):

WITH genai.vector.encode($question, "OpenAI", { token: $openAiApiKey }) AS qEmbedding
CALL db.index.vector.queryNodes('movie_tagline_embeddings', $topK, qEmbedding)
  YIELD node AS movie, score
RETURN movie.title, movie.tagline, score

For “movies about love” the top hits are Joe vs. the Volcano (0.89) and When Harry Met Sally (0.87); switch the question to “adventure” and entirely different movies surface — the search is semantic, not keyword-based.

Write a similarity-search query that encodes a question, searches a movie_tagline_embeddings index, and returns titles with scores. KGR-3.3

WITH genai.vector.encode($question, "OpenAI", {token: $openAiApiKey}) AS q CALL db.index.vector.queryNodes('movie_tagline_embeddings', $topK, q) YIELD node AS movie, score RETURN movie.title, score. Encode the question with the same model used for the stored embeddings, call queryNodes with the index name + top-K + query vector, and project title + score. All inputs ($question, $openAiApiKey, $topK) are query parameters.

Parameterised queries: security and speed

# WRONG: key in the query string
kg.query(f'... {{ token: "{api_key}" }} ...')
# RIGHT: key as a parameter
kg.query("... { token: $openAiApiKey } ...", params={"openAiApiKey": api_key})

Key concept

Parameters buy security and plan caching

KGR-3.5

Query parameters keep secrets and user input out of the query string (no injection, no leakage in logs) and let Neo4j cache the execution plan — the same parameterised query reused across calls skips re-planning. When it breaks: DDL statements (CREATE INDEX, CREATE CONSTRAINT) can’t take parameters, so sanitise those inputs manually. [V] Verified

Why pass an API key as a query parameter instead of interpolating it into the Cypher string? Name two benefits. KGR-3.5

Security — the key never appears in the query string, so it can’t leak into logs/error messages/the query cache, and there’s no injection surface. Performance — Neo4j caches the execution plan for a parameterised query and reuses it across calls with different values, skipping re-planning. (Caveat: DDL like CREATE INDEX can’t be parameterised.)

Neo4j as a vector store

Most RAG stacks bolt a separate vector database onto their primary store. Neo4j’s vector indexes remove that split — embeddings are node properties in the same database as the graph, so one query can search by similarity, traverse relationships, and return both the matched text and its graph context. [V] Verified

| Capability | Dedicated vector DB | Neo4j | | --- | --- | --- | | Vector similarity search | yes | yes | | Graph traversal | no | yes | | Full-text search | sometimes | yes | | Combined vector + graph query | no | yes | | Systems to operate | two | one |

What is the single biggest advantage of co-locating embeddings in Neo4j versus a separate vector database, and when might the separate DB still win? KGR-3.6

The advantage is combined queries: one Cypher query can do vector similarity search and graph traversal (retrieve a chunk, then follow its relationships) in a single round trip, with one system to operate. A dedicated vector DB can still win when the vector workload is large and standalone and you need advanced ANN algorithms or frequent embedding-model swaps that Neo4j’s vector support doesn’t match — then best-of-breed beats single-engine.

Reading similarity scores

When you evaluate results, read the shape, not just the top value: a clear winner (0.92 vs 0.75 for the rest) is high confidence; a flat band (0.78–0.82) means an ambiguous query or no strong match. A score threshold (say 0.7) filters weak context that would otherwise tempt the LLM to hallucinate — and scores are relative, so 0.85 on movie taglines is not the same quality as 0.85 on SEC filings.

Interpret a result set and set a threshold Worked example

Problem. A search for “renewable energy initiatives” over corporate filings returns: solar manufacturing 0.91, wind investment 0.88, carbon-credit trading 0.84, employee volunteering 0.82, office recycling 0.79. Which are relevant? What threshold? How to improve precision without changing the embedding model?

Reasoning. The top three are squarely about energy/environment; the bottom two are corporate-responsibility neighbours, not energy — semantic near-misses. A threshold around 0.83 keeps the genuine matches (0.84 and up) and drops the rest (tune by need: ~0.78 for high recall, ~0.88+ for high precision). Precision without a new model: add a re-ranker (cross-encoder), chunk more granularly (more specific embeddings), pre-filter by metadata, or go hybrid (vector + keyword).

Answer. Relevant: ranks 1–3; false positives: 4–5. Threshold ≈ 0.83 (just below the 0.84 you want to keep). Improve with re-ranking, smaller chunks, metadata filtering, or hybrid search — because scores are relative, calibrate the threshold against this corpus, not an absolute.

Configure an index and debug zero results Worked example

Problem. Set up a vector index for 50,000 products whose description embeddings are 768-dimensional. Then: a search returns zero results — three likely causes?

Reasoning.

CREATE VECTOR INDEX product_embeddings IF NOT EXISTS
FOR (p:Product) ON (p.descriptionEmbedding)
OPTIONS { indexConfig: {
  `vector.dimensions`: 768,
  `vector.similarity_function`: 'cosine'
}}

(Use 768 to match the model; cosine is the safe default — for normalised embeddings cosine and Euclidean rank identically, so prefer Euclidean only when vector magnitude carries meaning, e.g. raw TF-IDF.)

Answer. Zero/empty results usually means: (1) the index is still POPULATING, not ONLINE; or (2) no embeddings stored (the property is null on every node). A dimension mismatch (index 768 vs embeddings 1536) is a different symptom — the query fails with an error, not silent emptiness. Debug with the checklist: index status → embedding population → dimension match → query-vector format.

A similarity search returns uniformly low scores (all ~0.4). Name two possible causes. KGR-3.7

Two of: the query is out of the corpus’s domain (nothing stored is genuinely similar, so even the nearest neighbours are far); a dimension/model mismatch (the query was encoded with a different model than the stored embeddings); embeddings were never populated or the index isn’t ONLINE; or the corpus genuinely lacks relevant content. Low absolute scores can also just mean this corpus’s scores run low — read the distribution, not one number.

Summary

A vector index declares the embedding property, its dimensionality, and the similarity function. genai.vector.encode generates embeddings from inside Cypher and stores them as node properties — co-located with the graph, so no separate vector DB. db.index.vector.queryNodes does top-K nearest-neighbour search; query parameters keep API keys safe and enable plan caching; and cosine scores are relative signals to be read as a distribution, not absolute truth. Right now Neo4j is “just” a vector store — the transformative combination of vector search with graph traversal arrives in Module 5.

Chapter 4 — constructing the graph: SEC 10-K chunks become nodes, indexed for vector search.
Chapters 5–6 — adding relationships and expanding the SEC graph.
Chapter 7 — combined retrieval with LLM-generated Cypher.