Part 1 Chapter 3 Last verified 2026-06-19

Preparing Text for RAG

Adding vector search to the graph — creating a vector index (dimensions + similarity function), generating embeddings with genai.vector.encode and storing them as node properties, top-K similarity search with db.index.vector.queryNodes, parameterised queries for security, Neo4j as a co-located vector store, and how to read similarity scores.

On this page
  1. Creating a vector index
  2. Generating and storing embeddings
  3. Searching by similarity
  4. Parameterised queries: security and speed
  5. Neo4j as a vector store
  6. Reading similarity scores
  7. Summary

Creating a vector index

A vector index tells Neo4j which node property holds embeddings, how many dimensions they have, and which similarity function to compare them with. It’s idempotent with IF NOT EXISTS: [V] Verified

CREATE VECTOR INDEX movie_tagline_embeddings IF NOT EXISTS
FOR (m:Movie) ON (m.taglineEmbedding)
OPTIONS { indexConfig: {
  `vector.dimensions`: 1536,
  `vector.similarity_function`: 'cosine'
}}

| Parameter | Value | Why | | --- | --- | --- | | vector.dimensions | 1536 | OpenAI’s default embedding size — must match the model exactly | | vector.similarity_function | cosine | OpenAI’s recommendation; normalises away vector magnitude | | IF NOT EXISTS | — | idempotent creation, safe to re-run |

Write the Cypher to create a vector index on a Chunk node's textEmbedding property (1536 dims, cosine), and say what status means it's ready. KGR-3.1

CREATE VECTOR INDEX chunk_embeddings IF NOT EXISTS FOR (c:Chunk) ON (c.textEmbedding) OPTIONS {indexConfig: {vector.dimensions: 1536, vector.similarity_function: 'cosine'}}. It’s ready when SHOW INDEXES reports its status as ONLINE (not POPULATING). The dimension must match the embedding model’s output exactly, or queries fail with a dimension-mismatch error.

Generating and storing embeddings

Neo4j’s genai.vector.encode calls the embedding API from inside Cypher and writes the result onto the node. The API key goes in as a query parameter, never the query string:

MATCH (m:Movie) WHERE m.tagline IS NOT NULL
WITH m, genai.vector.encode(m.tagline, "OpenAI", { token: $openAiApiKey }) AS vector
CALL db.create.setNodeVectorProperty(m, "taglineEmbedding", vector)
// verify: tagline text, first values, and the vector size (1536)
MATCH (m:Movie {title: "The Matrix"})
RETURN m.tagline, m.taglineEmbedding[0..5] AS sample, size(m.taglineEmbedding) AS dims

An embedding is a dense vector where semantically similar texts sit close together; OpenAI’s text-embedding-3-small outputs 1536 dimensions (its -3-large outputs 3072), which is exactly why the index above declares 1536.

What does genai.vector.encode do, and what three arguments does it take? KGR-3.2

It calls an external embedding API (e.g. OpenAI) from inside Cypher and returns the embedding as a float array. Its three arguments are the text to encode, the provider name ("OpenAI"), and a config map (including the API token, passed as a query parameter). Paired with db.create.setNodeVectorProperty, it stores the vector directly on the node — co-locating vectors with the graph.

Searching by similarity

With embeddings indexed, encode the question and find nearest neighbours with db.index.vector.queryNodes (top-K):

WITH genai.vector.encode($question, "OpenAI", { token: $openAiApiKey }) AS qEmbedding
CALL db.index.vector.queryNodes('movie_tagline_embeddings', $topK, qEmbedding)
  YIELD node AS movie, score
RETURN movie.title, movie.tagline, score

For “movies about love” the top hits are Joe vs. the Volcano (0.89) and When Harry Met Sally (0.87); switch the question to “adventure” and entirely different movies surface — the search is semantic, not keyword-based.

Write a similarity-search query that encodes a question, searches a movie_tagline_embeddings index, and returns titles with scores. KGR-3.3

WITH genai.vector.encode($question, "OpenAI", {token: $openAiApiKey}) AS q CALL db.index.vector.queryNodes('movie_tagline_embeddings', $topK, q) YIELD node AS movie, score RETURN movie.title, score. Encode the question with the same model used for the stored embeddings, call queryNodes with the index name + top-K + query vector, and project title + score. All inputs ($question, $openAiApiKey, $topK) are query parameters.

Parameterised queries: security and speed

# WRONG: key in the query string
kg.query(f'... {{ token: "{api_key}" }} ...')
# RIGHT: key as a parameter
kg.query("... { token: $openAiApiKey } ...", params={"openAiApiKey": api_key})
Key concept

Parameters buy security and plan caching

KGR-3.5

Query parameters keep secrets and user input out of the query string (no injection, no leakage in logs) and let Neo4j cache the execution plan — the same parameterised query reused across calls skips re-planning. When it breaks: DDL statements (CREATE INDEX, CREATE CONSTRAINT) can’t take parameters, so sanitise those inputs manually. [V] Verified

Why pass an API key as a query parameter instead of interpolating it into the Cypher string? Name two benefits. KGR-3.5

Security — the key never appears in the query string, so it can’t leak into logs/error messages/the query cache, and there’s no injection surface. Performance — Neo4j caches the execution plan for a parameterised query and reuses it across calls with different values, skipping re-planning. (Caveat: DDL like CREATE INDEX can’t be parameterised.)

Neo4j as a vector store

Most RAG stacks bolt a separate vector database onto their primary store. Neo4j’s vector indexes remove that split — embeddings are node properties in the same database as the graph, so one query can search by similarity, traverse relationships, and return both the matched text and its graph context. [V] Verified

| Capability | Dedicated vector DB | Neo4j | | --- | --- | --- | | Vector similarity search | yes | yes | | Graph traversal | no | yes | | Full-text search | sometimes | yes | | Combined vector + graph query | no | yes | | Systems to operate | two | one |

What is the single biggest advantage of co-locating embeddings in Neo4j versus a separate vector database, and when might the separate DB still win? KGR-3.6

The advantage is combined queries: one Cypher query can do vector similarity search and graph traversal (retrieve a chunk, then follow its relationships) in a single round trip, with one system to operate. A dedicated vector DB can still win when the vector workload is large and standalone and you need advanced ANN algorithms or frequent embedding-model swaps that Neo4j’s vector support doesn’t match — then best-of-breed beats single-engine.

Reading similarity scores

When you evaluate results, read the shape, not just the top value: a clear winner (0.92 vs 0.75 for the rest) is high confidence; a flat band (0.78–0.82) means an ambiguous query or no strong match. A score threshold (say 0.7) filters weak context that would otherwise tempt the LLM to hallucinate — and scores are relative, so 0.85 on movie taglines is not the same quality as 0.85 on SEC filings.

A similarity search returns uniformly low scores (all ~0.4). Name two possible causes. KGR-3.7

Two of: the query is out of the corpus’s domain (nothing stored is genuinely similar, so even the nearest neighbours are far); a dimension/model mismatch (the query was encoded with a different model than the stored embeddings); embeddings were never populated or the index isn’t ONLINE; or the corpus genuinely lacks relevant content. Low absolute scores can also just mean this corpus’s scores run low — read the distribution, not one number.

Summary

A vector index declares the embedding property, its dimensionality, and the similarity function. genai.vector.encode generates embeddings from inside Cypher and stores them as node properties — co-located with the graph, so no separate vector DB. db.index.vector.queryNodes does top-K nearest-neighbour search; query parameters keep API keys safe and enable plan caching; and cosine scores are relative signals to be read as a distribution, not absolute truth. Right now Neo4j is “just” a vector store — the transformative combination of vector search with graph traversal arrives in Module 5.

  • Chapter 4constructing the graph: SEC 10-K chunks become nodes, indexed for vector search.
  • Chapters 5–6 — adding relationships and expanding the SEC graph.
  • Chapter 7 — combined retrieval with LLM-generated Cypher.