Answers & rationales

108 questions, grouped by chapter, answers revealed. Test yourself first in the practice questions.

Chapter 1

kgr-1-applications kg-fundamentals ↑ in bank

Give two real-world applications of knowledge graphs outside RAG, and state what the graph encodes in each.

Two of: Web search — knowledge cards (a celebrity's birth date, films, related people) are graph look-ups that aggregate a node's properties and its relationships into a summary panel; the graph encodes entities (people, companies, places) linked by typed relationships. E-commerce / recommendations — "customers who bought X also bought Y" traverses purchase relationships, and multi-hop preferences fall out of the graph; it encodes users, products, and purchase/view edges. Financial analysis — investments, corporate structures, and filings form a natural graph of companies, managers, and ownership relationships (the SEC dataset this guide uses). In each, the graph encodes *entities and the typed relationships between them*. Common wrong answer: "knowledge graphs are only useful for RAG" — they predate RAG and power search, recommendations, and fraud detection.

kgr-1-cypher-domain kg-fundamentals ↑ in bank

Write the Cypher text notation for: Professor “Dr. Smith” teaches “Machine Learning 101”, which belongs to the “CS” department. Name the labels and relationship types you used.

Write nodes in parentheses (label after a colon, properties in braces), relationships in square brackets with a typed arrow. For "Dr. Smith teaches Machine Learning 101, which belongs to the CS department": `(smith:Professor {name: "Dr. Smith"})-[:TEACHES]->(ml101:Course {title: "Machine Learning 101"})-[:BELONGS_TO]->(cs:Department {name: "CS"})`. Node labels: `Professor`, `Course`, `Department`; relationship types: `TEACHES`, `BELONGS_TO` (both directed). Properties like `name`/`title` sit inside the node braces; a relationship could carry its own property too (e.g. `[:TEACHES {term: "Fall"}]`). Common wrong answer: putting the relationship type in parentheses or omitting the arrow direction — Cypher needs `-[:TYPE]->` for a directed, typed relationship.

kgr-1-cypher-pattern kg-fundamentals ↑ in bank

Which Cypher pattern correctly expresses “a Person ACTED_IN a Movie”?

Why

Cypher puts nodes in parentheses and relationships in square brackets with a typed arrow: (p:Person)-[:ACTED_IN]->(m:Movie). The variant that swaps the brackets (nodes in [], relationship in ()) inverts the notation; the =>{...}=> form isn’t Cypher syntax; and wrapping the relationship type in plain parentheses omits the square brackets that mark a relationship. The bracket convention is what makes the pattern read like the graph it matches.

  1. (p:Person)-[:ACTED_IN]->(m:Movie)
  2. [p:Person]-(:ACTED_IN)->[m:Movie]
  3. (p:Person)=>{ACTED_IN}=>(m:Movie)
  4. (p:Person)-(ACTED_IN)->(m:Movie)

Correct: (p:Person)-[:ACTED_IN]->(m:Movie)

kgr-1-define-components kg-fundamentals ↑ in bank

Define the four core components of a knowledge graph (node, relationship, property, label) and give one example of each from a movie dataset.

Node — a data record representing an entity, e.g. a `Movie` node (graph-theory: vertex). Relationship — a directed, typed connection between two nodes, e.g. `ACTED_IN` from a Person to a Movie (graph-theory: edge, but richer). Property — a key-value pair on a node or relationship, e.g. a Movie's `title` or `released` year, or a relationship's `since`. Label — a tag grouping similar nodes, e.g. `Person` or `Movie`. Together: labelled nodes carry properties and are joined by typed, directed relationships that also carry properties. Common wrong answer: "a relationship is just an edge with no extra information" — relationships carry a type, direction, and their own properties, which is the whole reason the term is preferred.

kgr-1-design-rec-graph kg-fundamentals ↑ in bank

Model a small knowledge graph for an e-commerce “customers who bought X also bought Y” recommender: name the node labels, the relationship type(s), and the traversal that produces a recommendation.

Nodes (labels + properties): `Customer` (name, id), `Product` (title, category, price). One relationship type: `(:Customer)-[:PURCHASED {date}]->(:Product)`. The recommendation "customers who bought X also bought Y" is a multi-hop traversal: from product X, go *back* along `PURCHASED` to the customers who bought it, then *forward* along their other `PURCHASED` relationships to the products Y they also bought, ranked by frequency — i.e. `(x:Product)<-[:PURCHASED]-(c:Customer)-[:PURCHASED]->(y:Product)`. No separate "co-purchase" table is needed; the pattern is the recommendation. You could enrich it with a `VIEWED` relationship or `Category` nodes for content-based fallbacks. Common wrong answer: precomputing a co-purchase table — that's the relational workaround the graph traversal replaces, and it goes stale as purchases arrive.

kgr-1-edge-vs-relationship kg-fundamentals ↑ in bank

What distinguishes a knowledge-graph “relationship” from a graph-theory “edge”?

Why

A relationship carries a type, a direction, and its own properties — it’s a data record, not a bare connection, which is exactly why the KG vocabulary prefers the term. It does not require both endpoints to share a label (relationships routinely connect different labels, like PersonMovie). Relationships are directed (not undirected), and the difference is substantive, not just a rebrand — the richness is what enables traversal-based retrieval.

  1. A relationship is a rich record with a type, direction, and its own properties
  2. A relationship can only connect nodes that share the same label
  3. An edge is directed, whereas a relationship is undirected
  4. They are identical; "relationship" is only a rebranding of "edge"

Correct: A relationship is a rich record with a type, direction, and its own properties

kgr-1-identity-explain kg-fundamentals ↑ in bank

Explain the “identity through relationships” pattern and give the case where it breaks down.

"Identity through relationships" means a node's role is encoded by the typed relationships it participates in, not by extra labels or a type field. A `Person` is an actor because it has an `ACTED_IN` relationship, a director because it has `DIRECTED` — the same node, two roles, no `Actor`/`Director` labels. This keeps the schema clean and avoids "label explosion" (a new label for every role). Where it breaks: when a role needs role-specific properties — an actor's per-film salary vs a director's budget — those must then live on the relationship, which can get unwieldy at scale, so sometimes a role does warrant its own modelling. Common wrong answer: "add an Actor label and a Director label" — that duplicates information the relationships already encode and multiplies labels.

kgr-1-identity-relationships kg-fundamentals ↑ in bank

A Person is both an actor and a director. Following “identity through relationships,” how do you model this?

Why

The role comes from the relationships: one Person node that has both an ACTED_IN and a DIRECTED relationship is, by that fact, both an actor and a director — no role labels needed. Adding Actor/Director labels invites “label explosion”; splitting into two nodes duplicates the same person; and a roles property throws away the connections (you’d lose which movie they acted in or directed) that the relationships encode.

  1. Add `Actor` and `Director` labels to the Person node
  2. Create two separate Person nodes, one for each role
  3. Keep one Person node; its `ACTED_IN` and `DIRECTED` relationships define both roles
  4. Store a `roles` list property on the node and drop the relationships

Correct: Keep one Person node; its `ACTED_IN` and `DIRECTED` relationships define both roles

kgr-1-kg-helps-rag kg-fundamentals ↑ in bank

Why does a knowledge graph enhance RAG beyond vector similarity search?

Why

The graph adds a retrieval mode embeddings lack: traversal of typed relationships, which reaches entities connected to a result (a company’s investors, a chunk’s neighbouring sections) rather than merely similar text. It isn’t about compressing embeddings, it doesn’t swap semantic search for keyword matching, and it doesn’t remove the LLM — generation still happens; the graph just supplies richer, connection-aware context.

  1. It stores embeddings more compactly than a dedicated vector database
  2. It lets retrieval traverse typed relationships to reach connected entities
  3. It replaces the embedding model with exact keyword matching
  4. It removes the need for an LLM at generation time

Correct: It lets retrieval traverse typed relationships to reach connected entities

kgr-1-kg-vs-relational kg-fundamentals ↑ in bank

Which workload most favours a knowledge graph over a relational database?

Why

The graph wins on relationship complexity — multi-hop questions over interconnected entities, where each hop would be another SQL self-JOIN but is a cheap traversal in the graph. Tabular sales data with aggregations and primary-key lookups fits a relational database perfectly, and “semantically similar documents” is a vector-search task — neither needs graph traversal. Choose the KG for connection-heavy questions, not merely because relationships exist.

  1. 10M rows of tabular sales data queried with simple aggregations
  2. A stable schema queried almost entirely by primary key
  3. Finding semantically similar documents by embedding distance
  4. Multi-hop questions over highly interconnected entities (who-supplies-whom, many hops)

Correct: Multi-hop questions over highly interconnected entities (who-supplies-whom, many hops)

kgr-1-kg-vs-relational-scenario kg-fundamentals ↑ in bank

Give a scenario where you would choose a knowledge graph over a relational database, and justify it on schema flexibility, relationship complexity, and query pattern.

Decide on relationship complexity, not data size. For a fraud-detection system over accounts, transactions, devices, and shared addresses — where the valuable questions are multi-hop ("which accounts are linked through a shared device to a known-fraud account, within 3 hops?") — a knowledge graph wins: each hop is a cheap traversal, where the relational version is a pile of self-JOINs that explode with depth. Schema flexibility helps too (new link types appear as fraud evolves), and native vector+graph lets you combine similarity with traversal. Conversely, if the data were a flat ledger queried by account id with no multi-hop questions, a relational database would be simpler. Justify on: relationship complexity (multi-hop → graph), schema flexibility (evolving link types → graph), query pattern (traversal vs lookup). Common wrong answer: "it has 50M rows, so use a relational DB for scale" — size isn't the axis; relationship complexity is.

kgr-1-traversal-vs-vector kg-fundamentals ↑ in bank

Explain why a knowledge graph enhances RAG beyond vector similarity search, using a “who invested in NetApp?”-style question to make the point concrete.

Vector search retrieves text that is *semantically similar* to the query — great for "find me passages about NetApp's business." But the question "who invested in NetApp?" needs to *follow connections*: from the NetApp chunk to its filing, to the companies that filed ownership, to the investment firms. Embeddings can't traverse those typed relationships, so a similar-text match never reaches the investors. A knowledge graph adds traversal-based retrieval: retrieve a chunk by vector search, then walk relationships to expand context with connected entities. The two are complementary — similarity for "what's like this," traversal for "what's connected to this." Common wrong answer: "vector search can find the investors if the embeddings are good enough" — no embedding encodes a multi-hop ownership chain; that requires traversal.

Chapter 2

kgr-2-coactor-direction cypher-querying ↑ in bank

Which pattern finds Tom Hanks’s co-actors — people who acted in the same movie?

Why

Both actors point into the shared movie, so the arrows must converge on m: forward from Tom (-[:ACTED_IN]->) and backward from the co-actor (<-[:ACTED_IN]-). The variant with both arrows pointing the same way would require the movie to itself act in a movie; the one starting (tom)<-[:ACTED_IN]-(m) says a movie acted in Tom (wrong direction); and the single-hop (tom)-[:ACTED_IN]-(coActor) skips the shared movie entirely. The convergent two-hop through the movie is the co-actor pattern.

  1. (tom)-[:ACTED_IN]->(m)-[:ACTED_IN]->(coActor)
  2. (tom)<-[:ACTED_IN]-(m)-[:ACTED_IN]->(coActor)
  3. (tom)-[:ACTED_IN]->(m)<-[:ACTED_IN]-(coActor)
  4. (tom)-[:ACTED_IN]-(coActor)

Correct: (tom)-[:ACTED_IN]->(m)<-[:ACTED_IN]-(coActor)

kgr-2-connection cypher-querying ↑ in bank

Explain how the LangChain Neo4j Graph class connects a Python application to Neo4j: the three parameters it needs and the method used to run Cypher.

LangChain's `Neo4jGraph` class wraps a Neo4j driver. You construct it with three parameters — a connection `url` (the Neo4j URI), a `username`, and a `password` — loaded from environment variables or a `.env` file, never hard-coded. It then exposes a `query()` method: pass a Cypher string (optionally with a `params` dict for parameterised queries) and it returns the result as a list of Python dictionaries, handling connection pooling and authentication for you. That's the bridge from Python application code to the graph. Common wrong answer: "pass the credentials inline in the query string" — credentials belong in the connection (from the environment), not in Cypher.

kgr-2-create-vs-merge cypher-querying ↑ in bank

A bulk-load pipeline retries a batch and re-runs a relationship insert. Which clause is correct, and why?

Why

MERGE checks for the relationship and creates it only if it’s missing, so re-processing the same batch leaves the graph unchanged — the idempotency a retry-prone pipeline needs. CREATE does skip the existence check, but that’s exactly why it duplicates on every retry; LIMIT bounds returned rows, not writes, so it doesn’t prevent duplicate creation; and delete-then- create is wasteful and races with concurrent readers. MERGE is the production default for relationships.

  1. CREATE — it's faster because it skips the existence check
  2. MERGE — it creates the relationship only if absent, so retries don't duplicate
  3. CREATE with a LIMIT 1 to avoid making duplicates
  4. DELETE then CREATE on every run to stay consistent

Correct: MERGE — it creates the relationship only if absent, so retries don't duplicate

kgr-2-delete-node cypher-querying ↑ in bank

You run a plain DELETE on a Person node that still has an ACTED_IN relationship. What happens, and what’s the fix?

Why

A plain DELETE on a node that still has relationships fails — Neo4j won’t leave dangling relationships, so it protects referential integrity. The fix is to delete the relationships first, or use DETACH DELETE to remove the node and its relationships in one step. Neo4j does not auto-remove the relationships, leave them orphaned, or soft-delete — it refuses the operation outright.

  1. The node and all its relationships are removed automatically
  2. The node is deleted but its relationships are left orphaned
  3. The DELETE fails; remove the relationships first, or use DETACH DELETE
  4. Neo4j converts it to a soft-deleted node flagged as inactive

Correct: The DELETE fails; remove the relationships first, or use DETACH DELETE

kgr-2-detach-delete cypher-querying ↑ in bank

You need to remove a Person node that still has relationships. Show how (and why a plain DELETE fails).

A plain `DELETE` on a node that still has relationships fails, because Neo4j won't leave dangling relationships (referential integrity). Two fixes: delete the relationships first (`MATCH (p:Person {name: $name})-[r]-() DELETE r`, then delete the node), or — simpler — use `MATCH (p:Person {name: $name}) DETACH DELETE p`, which removes the node and all its relationships in one atomic step. Use DETACH DELETE whenever you're dropping a connected node. Common wrong answer: "DELETE removes the node and silently drops its edges" — it doesn't; it refuses the operation rather than orphan relationships.

kgr-2-execution-order cypher-querying ↑ in bank

What is the order of the four stages a Cypher query passes through?

Why

Parse → plan → execute → project. The string is first parsed into an AST; the optimiser then plans (choosing indexes and scan order) before anything runs; the engine executes the plan by traversing the graph; finally RETURN projects the requested fields. You can’t plan before parsing, can’t execute before planning, and projection is last — it shapes the output of a completed match. (EXPLAIN stops after planning; PROFILE runs through execution.)

  1. plan → parse → project → execute
  2. execute → parse → plan → project
  3. parse → execute → plan → project
  4. parse → plan → execute → project

Correct: parse → plan → execute → project

kgr-2-match-where cypher-querying ↑ in bank

Write a Cypher query that returns the name and birth year of every Person born before 1960. Name which clause does the filtering and why it can’t go inside the node’s braces.

`MATCH (p:Person) WHERE p.born < 1960 RETURN p.name, p.born`. The `MATCH` binds every `Person` node, the `WHERE` filters to those born before 1960, and `RETURN` projects just the name and birth year (dot notation) rather than the whole node. A range like `< 1960` requires `WHERE`; inline `{born: 1960}` would only match the exact value 1960. Common wrong answer: putting the comparison inside the node braces, e.g. `(p:Person {born < 1960})` — inline property matching does equality only, not ranges.

kgr-2-merge-explain cypher-querying ↑ in bank

Compare CREATE and MERGE for a graph-construction pipeline that may retry batches. Explain why MERGE gives idempotency and the one thing that can still make MERGE duplicate.

CREATE inserts unconditionally; MERGE checks whether the element already exists and creates it only if absent (MATCH-then-create in one atomic step). In a pipeline that may re-process a batch — API retries, replays, duplicate messages — CREATE produces duplicate nodes/ relationships on each run (or errors against a uniqueness constraint), whereas MERGE leaves the graph unchanged on re-runs, giving idempotent, exactly-once semantics. The full pattern: a uniqueness constraint on the key + MERGE on that stable identifier + `ON CREATE SET` (initial properties) and `ON MATCH SET` (volatile updates like a timestamp). Crucially, MERGE on a *volatile* property creates a new element each run, so always MERGE on the stable id. Common wrong answer: "use CREATE and dedupe later" — that lets duplicates into the graph and pushes the cost downstream; MERGE prevents them at write time.

kgr-2-multihop-design cypher-querying ↑ in bank

Design a Cypher query that finds every actor who appeared in a movie directed by a given person. Explain the arrow directions and how you’d keep the result set bounded.

Two relationship hops share the movie node: `MATCH (d:Person {name: $director})-[:DIRECTED]-> (m:Movie)<-[:ACTED_IN]-(actor:Person) RETURN actor.name, m.title`. The `DIRECTED` arrow points forward from the director into the movie; the `ACTED_IN` arrow points backward from the actor into the same movie — the shared `Movie` is the join. To bound the result on a high-degree director, add a `LIMIT` or filter (e.g. by `released` year), since a director with many films each with many actors can return a large cross-product. Common wrong answer: pointing both arrows the same direction — then the pattern asks for an actor the *movie* acted in, which has no match.

kgr-2-naming cypher-querying ↑ in bank

Why should Cypher variable names be meaningful, given the engine runs the query identically regardless of the names?

The engine ignores variable names — it runs `(n1)-[:ACTED_IN]->(n2)` exactly as it runs `(tom)-[:ACTED_IN]->(movie)` — so meaningful names cost nothing at runtime. Their entire value is for humans: a pattern named for what each variable holds (`tom`, `coActor`, `m` for Movie) reads as its own documentation, while `n1/n2/n3` force the reader to reconstruct the intent, and the cost compounds in multi-hop queries. Readable Cypher is a maintainability investment paid back every time someone (often you, months later) revisits the query. Common wrong answer: "meaningful names make the query run faster" — names have no effect on execution; the benefit is purely readability.

kgr-2-neo4jgraph cypher-querying ↑ in bank

How does LangChain’s Neo4jGraph class let a Python application run Cypher?

Why

Neo4jGraph wraps a Neo4j connection built from a url, username, and password (loaded from the environment) and exposes a query() method that sends a Cypher string and returns the result as Python dictionaries. It does not transpile Cypher to SQL (it talks to Neo4j directly), does not embed an in-memory graph (it connects to a real database), and does not author Cypher from natural language — that’s Module 7’s LLM-generated-Cypher topic, a separate capability.

  1. It wraps a connection (url/user/password), exposing query() for Cypher
  2. It transpiles Cypher into SQL for a relational backend
  3. It embeds the graph in memory so no database is needed
  4. It generates the Cypher from natural language for you

Correct: It wraps a connection (url/user/password), exposing query() for Cypher

kgr-2-readable-query cypher-querying ↑ in bank

Rewrite this query for readability and explain what improved: MATCH (a)-[:ACTED_IN]->(b)<-[:DIRECTED]-(c) RETURN a.name, c.name.

Rewrite `MATCH (a)-[:ACTED_IN]->(b)<-[:DIRECTED]-(c) RETURN a.name, c.name` with names that say what each variable is: `MATCH (actor:Person)-[:ACTED_IN]->(movie:Movie)<-[:DIRECTED]-(director: Person) RETURN actor.name, director.name`. Adding the `Person`/`Movie` labels also lets the planner narrow the search, and the structure now reads as "an actor's movie, also directed by a director." Same execution, far clearer intent — and the labels can improve the plan. Common wrong answer: leaving the single-letter variables and adding a comment instead — a comment drifts out of date, whereas self-describing variable names stay correct because they're part of the query.

kgr-2-trace-execution cypher-querying ↑ in bank

Trace the execution of MATCH (p:Person)-[:ACTED_IN]->(m:Movie) RETURN p.name, m.title LIMIT 5 through the four stages. Say what the engine does at each.

For `MATCH (p:Person)-[:ACTED_IN]->(m:Movie) RETURN p.name, m.title LIMIT 5`: Parse — the string becomes an AST capturing the pattern (a `Person` linked by `ACTED_IN` to a `Movie`) and the projection. Plan — the optimiser uses the `Person`/`Movie` labels (and any index) to choose a starting point and an expansion strategy. Execute — the engine finds `Person` nodes, traverses their outgoing `ACTED_IN` relationships to `Movie` nodes, binding `p` and `m`, and stops early once 5 matches satisfy `LIMIT`. Project — it returns only `p.name` and `m.title` for those 5 rows. The flow is pattern-spec → match → projection, with the plan deciding *how* the match is done. Common wrong answer: "it scans every Person then filters at the end" — the planner uses labels/indexes and the `LIMIT` to avoid scanning the whole graph.

kgr-2-where-vs-match cypher-querying ↑ in bank

Which Cypher query returns all movies released after 2000?

Why

A range comparison needs a WHERE clause: MATCH (m:Movie) WHERE m.released > 2000 RETURN m. Inline {...} property matching only does equality, so {released > 2000} isn’t valid syntax. Cypher has no RETURN ... IF construct, and the pattern clause is MATCH, not FILTER. Use inline {} for equality and WHERE for ranges and boolean logic.

  1. MATCH (m:Movie {released > 2000}) RETURN m
  2. MATCH (m:Movie) WHERE m.released > 2000 RETURN m
  3. MATCH (m:Movie) RETURN m IF m.released > 2000
  4. FILTER (m:Movie) WHERE m.released > 2000 RETURN m

Correct: MATCH (m:Movie) WHERE m.released > 2000 RETURN m

Chapter 3

kgr-3-colocation text-for-rag ↑ in bank

What is the defining advantage of storing embeddings inside Neo4j rather than in a separate vector database?

Why

Co-location means embeddings live as node properties in the same store as the graph, so a single Cypher query can do vector search and traverse relationships — retrieve a chunk, then follow its connections — in one round trip, with one system to run. It doesn’t make embedding computation faster (that’s still the API), doesn’t remove the need for an embedding model, and doesn’t make the similarity math more accurate — the win is the combined query.

  1. Neo4j computes embeddings faster than the OpenAI API
  2. It removes the need for an embedding model entirely
  3. Vector similarity search is inherently more accurate inside Neo4j
  4. One query can combine vector similarity with graph traversal in a single round trip

Correct: One query can combine vector similarity with graph traversal in a single round trip

kgr-3-cosine-dims text-for-rag ↑ in bank

Your embeddings are 1536-dimensional (OpenAI) but the vector index was created with 768 dimensions. What happens, and why is cosine the recommended similarity function?

Why

The index dimension must equal the embedding dimension, so a 768-dim index can’t match 1536-dim vectors — the query fails with a dimension-mismatch error (there’s no automatic projection or padding). Cosine is recommended because it compares vector direction and normalises away magnitude, which is the right notion of similarity for embeddings (and the choice OpenAI documents). It isn’t a speed-vs-accuracy trade, and the index never silently changes its configured function.

  1. It works fine; Neo4j projects the 1536-dim vectors down to 768 automatically to fit
  2. The query errors on the dimension mismatch; cosine is recommended because it normalises away magnitude
  3. It works but runs slower, and cosine is chosen mainly because it computes faster than Euclidean
  4. The index silently switches to Euclidean distance to accommodate the larger vectors

Correct: The query errors on the dimension mismatch; cosine is recommended because it normalises away magnitude

kgr-3-cosine-explain text-for-rag ↑ in bank

Why is cosine similarity recommended for OpenAI embeddings, what does a score of ~0.89 mean, and why must the index dimension match the model?

Cosine similarity measures the angle between two vectors and ignores their magnitude ($\cos\theta = (A·B)/(\lVert A\rVert\lVert B\rVert)$), so it captures *direction* — the right notion of semantic closeness for embeddings — and is OpenAI's recommended function; for normalised vectors it also ranks identically to Euclidean. A cosine score of ~0.89 on normalised embeddings is a strong semantic match (with ~0.75 moderate, below 0.5 weak), though scores are relative to the model and corpus. And the index's declared dimension must equal the model's output (1536 for OpenAI) — a mismatch makes the query fail with a dimension error. Common wrong answer: "a higher dimension always means better search" — more dimensions cost more storage/time and only help if the model actually uses them well.

kgr-3-create-index text-for-rag ↑ in bank

Write the Cypher to create a vector index on a Chunk node’s textEmbedding property (1536 dimensions, cosine), and say which status from SHOW INDEXES means it’s ready to query.

`CREATE VECTOR INDEX chunk_embeddings IF NOT EXISTS FOR (c:Chunk) ON (c.textEmbedding) OPTIONS {indexConfig: {`vector.dimensions`: 1536, `vector.similarity_function`: 'cosine'}}`. The dimension (1536) must match the embedding model's output exactly, and `cosine` is the recommended similarity function for OpenAI embeddings. `IF NOT EXISTS` makes creation idempotent. It is ready to query when `SHOW INDEXES` reports its status as **ONLINE** — while it is `POPULATING`, searches return incomplete or empty results. Common wrong answer: querying immediately after creating it — you must wait for ONLINE, or the results are wrong.

kgr-3-encode-search-flow text-for-rag ↑ in bank

Describe the four stages of a Neo4j vector similarity search (encode → search → score → return) and the one consistency rule that makes the results meaningful.

Four stages: encode — turn the user's question into a vector with the same embedding model that produced the stored vectors; search — `db.index.vector.queryNodes` finds the top-K nodes whose vectors are nearest the query vector; score — each result carries a cosine similarity (0–1), higher = closer; return — the matched nodes' text and metadata become the LLM's context. The one consistency rule that makes it work: query and corpus must be embedded by the *same* model into the *same* space. Parameterise the question, key, and topK so the plan caches and secrets stay out of the string. Common wrong answer: embedding the query with a different model than the stored text — the vectors then live in different spaces and the scores are noise.

kgr-3-encode-store text-for-rag ↑ in bank

Write a Cypher query that generates an OpenAI embedding for each Chunk node’s text and stores it on the node. Name the two procedures involved and where the API key goes.

Match the nodes, encode their text, and store the vector on each node: `MATCH (c:Chunk) WHERE c.text IS NOT NULL WITH c, genai.vector.encode(c.text, "OpenAI", {token: $openAiApiKey}) AS v CALL db.create.setNodeVectorProperty(c, "textEmbedding", v)`. The `WHERE ... IS NOT NULL` skips chunks with no text; `genai.vector.encode` returns the embedding (text, provider, config with the token as a parameter); `setNodeVectorProperty` writes it onto the node so it's co-located with the graph. Pass `$openAiApiKey` via `params`, never in the string. Then verify with `size(c.textEmbedding)` = 1536. Common wrong answer: returning the vector to Python and writing it back in a second round trip — `setNodeVectorProperty` stores it in the same query.

kgr-3-genai-encode text-for-rag ↑ in bank

What does Neo4j’s genai.vector.encode function do?

Why

genai.vector.encode calls an external provider (e.g. OpenAI) from within Cypher, taking the text, provider name, and a config map (with the API token), and returns the embedding as a float array — which you then store with db.create.setNodeVectorProperty. Creating the index is CREATE VECTOR INDEX; the search is db.index.vector.queryNodes; and it calls a hosted model, it doesn’t train one. It is the generate-an-embedding step, not the index or the search.

  1. It calls an external embedding API from inside Cypher, returning the embedding
  2. It creates the vector index on a node property
  3. It performs the nearest-neighbour search and returns scored nodes
  4. It trains a custom embedding model on your graph data

Correct: It calls an external embedding API from inside Cypher, returning the embedding

kgr-3-params-security text-for-rag ↑ in bank

Rewrite this insecure call to use a query parameter, and name two benefits: kg.query(f'... {{ token: "{api_key}" }} ...').

Rewrite `kg.query(f'... {{ token: "{api_key}" }} ...')` as `kg.query("... { token: $openAiApiKey } ...", params={"openAiApiKey": api_key})`. The f-string version interpolates the secret into the query text, where it can leak into logs, error messages, and the query cache (and opens an injection hole for any user-derived value). The parameterised version keeps the key out of the string entirely. Two benefits: security (no leakage, no injection) and performance (Neo4j caches the execution plan for the parameterised query and reuses it across calls). Caveat: DDL like `CREATE INDEX` can't take parameters, so sanitise those manually. Common wrong answer: "log the query for debugging, key and all" — that's exactly how keys leak.

kgr-3-params-why text-for-rag ↑ in bank

Why pass an API key into a Cypher query as a parameter ($openAiApiKey) rather than an f-string?

Why

A parameter keeps the secret out of the query string — so it can’t leak into logs, error messages, or the query cache, and there’s no injection surface — and it lets Neo4j reuse a cached execution plan across calls. Typing speed is irrelevant; you should not store API keys in the graph; and you technically can build a query string with interpolation (that’s exactly the unsafe thing to avoid), so it isn’t that Cypher forbids it.

  1. f-strings are slower to type than parameters
  2. Parameters let you store the API key inside the graph for reuse
  3. Parameters keep the key out of the query string and let Neo4j cache the plan
  4. Cypher doesn't support string interpolation, so parameters are the only option

Correct: Parameters keep the key out of the query string and let Neo4j cache the plan

kgr-3-score-interpret text-for-rag ↑ in bank

A cosine similarity search returns a flat band of scores (0.78–0.82) with no clear winner. What does this most likely indicate?

Why

A flat band means no single stored item stands out — usually an ambiguous query or a corpus without a strong match — so you read the distribution (a clear winner like 0.92 vs 0.75 would signal confidence) and respond with a threshold or hybrid search. It is not evidence of a corrupt index or broken cosine math (those tend to give zero/garbage results, not a tidy band), and a flat band is the opposite of high-quality discrimination.

  1. The vector index is corrupt and must be rebuilt
  2. The query is ambiguous or the corpus lacks a strong match
  3. The embeddings are unusually high quality
  4. Cosine similarity is malfunctioning

Correct: The query is ambiguous or the corpus lacks a strong match

kgr-3-score-threshold text-for-rag ↑ in bank

A similarity search for “renewable energy initiatives” returns five chunks scored 0.91, 0.88, 0.84, 0.82, 0.79 on related-but-varying topics. Which are relevant, what threshold would you set, and how would you raise precision without changing the embedding model?

Read relevance from the score gap, not absolute values. If a "renewable energy" search returns solar 0.91, wind 0.88, carbon-credit 0.84, then employee-volunteering 0.82 and recycling 0.79, the top three are squarely on-topic while the bottom two are corporate-responsibility near-misses — semantically adjacent but not energy. A threshold around 0.83 keeps the genuine matches (0.84 and up) and drops the rest; tune it by need (≈0.78 for high recall, ≈0.88+ for high precision), and calibrate against *this* corpus since scores are relative. To improve precision without changing the embedding model: add a cross-encoder re-rank, chunk more granularly, pre-filter by metadata, or go hybrid (vector + keyword). Common wrong answer: applying a fixed threshold like 0.8 across corpora — a good cutoff for taglines may be wrong for filings.

kgr-3-search-query text-for-rag ↑ in bank

Write a complete vector similarity-search query: encode a user $question, search a chunk_embeddings index for the top-K matches, and return the chunk text with scores.

Encode the question with the *same* model, search the index, and project text + score: `WITH genai.vector.encode($question, "OpenAI", {token: $openAiApiKey}) AS q CALL db.index.vector.queryNodes('chunk_embeddings', $topK, q) YIELD node, score RETURN node.text, score ORDER BY score DESC`. `queryNodes` takes the index name, a top-K count, and the query vector, and yields each matched node with its similarity score. All inputs — `$question`, `$openAiApiKey`, `$topK` — are query parameters. Crucially the question must be embedded with the same model used for the stored vectors, or the comparison is meaningless. Common wrong answer: hard-coding topK or the question in the string instead of parameterising them.

kgr-3-vector-index text-for-rag ↑ in bank

Which statement about creating and using a Neo4j vector index is correct?

Why

Two hard requirements: the index’s declared dimension must equal the embedding model’s output (1536 for OpenAI), and the index must reach ONLINE status — a POPULATING index returns incomplete or empty results. Neo4j does not pad or truncate to reconcile a dimension mismatch (a mismatch raises a dimension error at query time), status very much matters, and a vector index does require a similarity function (e.g. cosine) in its config.

  1. The index dimension can differ from the model's output; Neo4j pads or truncates vectors to fit
  2. You can query the index the moment it's created; whether it's ONLINE or POPULATING is irrelevant
  3. A vector index doesn't need a similarity function specified in its options
  4. The dimension must match the model's output exactly, and the index must be ONLINE before querying

Correct: The dimension must match the model's output exactly, and the index must be ONLINE before querying

kgr-3-vs-vectordb text-for-rag ↑ in bank

When should you use Neo4j as your vector store versus a dedicated vector database alongside the graph? Give the deciding factor.

Co-locate in Neo4j when your retrieval genuinely *combines* similarity with traversal — the whole graph-RAG thesis: one query finds the nearest chunk and follows its relationships to pull in connected context, in a single round trip and one system to operate. Choose a dedicated vector DB (Pinecone, Weaviate, Milvus) alongside the graph when the vector workload is large and standalone and you need advanced ANN algorithms (IVF-PQ, ScaNN), frequent embedding-model swaps, or independent scaling of the vector tier — capabilities a single-engine setup may not match. The deciding question is whether the *queries* mix vector and graph, not whether the data merely has both. Common wrong answer: "always co-locate, fewer systems is always better" — fewer systems is simpler, but a heavy standalone vector workload can outgrow Neo4j's ANN support.

Chapter 4

kgr-4-10k-sections kg-construction ↑ in bank

Which sections of an SEC Form 10-K are used to construct the knowledge graph, and what does each contain?

Four sections drive 10-K knowledge-graph construction: Item 1 (business description), Item 1A (risk factors), Item 7 (management's discussion and analysis), and Item 7A (quantitative and qualitative market-risk disclosures). They are extracted from the raw XML into JSON along with the CIK and CUSIP identifiers. Common wrong answer: naming the financial statements / balance sheet — those are the numeric tables (largely Item 8); the course builds the graph from the narrative text sections, which embed well for semantic retrieval.

kgr-4-chunk-count kg-construction ↑ in bank

A 10-K section is 50,000 characters. Using chunk_size=2000 and chunk_overlap=200, about how many chunks result? Show the reasoning, not just the number.

Each chunk advances by chunk_size minus overlap, so the count is roughly the total length divided by that effective advance. For a 50,000-character section at chunk_size 2000, overlap 200, the advance is 2000 − 200 = 1800, giving ceil(50,000 / 1800) ≈ 28 chunks. The formula is ceil((L − O) / (S − O)) for length L, size S, overlap O. Common wrong answer: dividing by the full chunk size (50,000 / 2000 = 25) — that ignores overlap, which shrinks the advance and so slightly raises the count.

kgr-4-chunk-size-tradeoff kg-construction ↑ in bank

You raise the chunk size from 2,000 to 8,000 characters. What is the most likely effect on retrieval quality?

Why

A larger chunk blends more topics into one vector, so the embedding becomes generic and matches specific queries less precisely — the classic size/precision tradeoff. Bigger chunks reduce the count (fewer, larger pieces), so embedding cost falls rather than rises; more context does not reliably improve precision here; and chunk size does not push all cosine scores uniformly higher.

  1. Retrieval precision improves because each chunk now carries far more context
  2. The chunk count rises sharply, which mainly raises embedding cost
  3. Embeddings get more generic, so retrieval loses precision on specific queries
  4. Cosine similarity scores become uniformly higher for every query

Correct: Embeddings get more generic, so retrieval loses precision on specific queries

kgr-4-chunkseqid kg-construction ↑ in bank

Why is chunkSeqId stored in each chunk’s metadata?

Why

chunkSeqId preserves a chunk’s position within its section, so Module 5 can wire chunks into a NEXT linked list for adjacent-context retrieval. The unique key MERGE dedupes on is chunkId, not chunkSeqId; similarity scores are computed at query time, not stored; and the filing/company are captured by formId/cik, not the sequence id.

  1. It records each chunk's order within its section, enabling NEXT links in later modules
  2. It is the unique key that MERGE uses to prevent duplicate chunk nodes on reload
  3. It caches the chunk's cosine similarity score so results can be ranked faster
  4. It identifies which company and filing the chunk was originally extracted from

Correct: It records each chunk's order within its section, enabling NEXT links in later modules

kgr-4-cik-cusip kg-construction ↑ in bank

In an SEC Form 10-K pipeline, what role do CIK and CUSIP play?

Why

CIK (Central Index Key) and CUSIP are identifiers: the CIK keys a company across all its SEC filings, and the CUSIP (first 6 digits) identifies the issuer across datasets — so they become the linking keys when the graph gains relationships. They are not text sections (those are Item 1/1A/7/7A), not vector-index parameters (dimensions and similarity function), and not LangChain chain types.

  1. Company and security identifiers that join a filing to the same company across datasets
  2. The two main text sections of the filing that get chunked and embedded for retrieval
  3. Vector-index settings that fix the embedding dimensionality and similarity function
  4. LangChain chain types that wrap retrieval and LLM generation into one call

Correct: Company and security identifiers that join a filing to the same company across datasets

kgr-4-constraint-why kg-construction ↑ in bank

Why create a uniqueness constraint on Chunk.chunkId before bulk-loading with MERGE?

Why

The constraint does two jobs: it guarantees no duplicate chunkId, and it creates an implicit index so each MERGE existence check is an index lookup rather than a full scan of every Chunk node — turning quadratic loading into roughly linear. It has nothing to do with encryption or score normalisation, and ON CREATE SET (not the constraint) is what limits property writes to first creation.

  1. It encrypts the chunkId to hide its value from query logs and caches
  2. It converts cosine scores into a 0 to 1 range for the vector index
  3. It blocks duplicate chunks and adds an index, so each MERGE is a lookup not a scan
  4. It lets MERGE rewrite properties on every run rather than only the first

Correct: It blocks duplicate chunks and adds an index, so each MERGE is a lookup not a scan

kgr-4-from-existing-graph kg-construction ↑ in bank

You have already embedded and indexed your chunks in Neo4j. Which Neo4jVector constructor connects to them, and what currency caveat applies to its import?

Use `from_existing_graph` when the chunks are already embedded and indexed in Neo4j (as in this pipeline) — it attaches to that index instead of re-embedding. Currency caveat: the import moved to the dedicated langchain-neo4j package (2024-11), so the current import is `from langchain_neo4j import Neo4jVector`, not `from langchain_community.vectorstores import Neo4jVector`. The connect/retrieve/generate behaviour is unchanged; only the package path differs. Common wrong answer: re-running an embedding constructor (from_documents), which would redundantly re-embed text you already indexed.

kgr-4-hallucination-debug kg-construction ↑ in bank

A RAG system over only NetApp’s 10-K answers “What is Apple’s revenue?” with a confident, specific figure attributed to Apple. Identify what went wrong, the root cause, and how to prevent it.

Diagnosis: entity mis-attribution plus invented facts. The graph holds only NetApp's filing, so a query about Apple retrieves NetApp's nearest chunks (highest cosine among what exists), and the LLM applies that context to Apple and fabricates a revenue figure. Root cause: no entity-alignment check and no refusal instruction in the prompt. Fix: a prompt that says "use ONLY this context", names the source companies, and instructs "if the context is not about the company in the question, say you don't have that data" — plus retrieval-score gating so out-of-scope questions return no context at all. Note the limit: prompt engineering reduces but does not eliminate hallucination, so pair it with a score threshold.

kgr-4-idempotent-merge kg-construction ↑ in bank

How do you load chunk nodes so the pipeline can be re-run safely after a failure without creating duplicates? Name the pieces and sketch the Cypher.

Idempotency means re-running the pipeline yields the same graph, which retry-prone ingestion needs. Three pieces deliver it: a uniqueness constraint on the stable key (no duplicates, plus an implicit index), MERGE on that key (match-or-create instead of always insert), and ON CREATE SET to write properties only on first creation. Concretely: `MERGE (c:Chunk {chunkId: $p.chunkId}) ON CREATE SET c.text = $p.text, c.formId = $p.formId`, behind `CREATE CONSTRAINT ... FOR (c:Chunk) REQUIRE c.chunkId IS UNIQUE`. Common wrong answer: using CREATE — it inserts a new node every run, so a retry duplicates every chunk.

kgr-4-merge-volatile kg-construction ↑ in bank

A teammate’s loader uses MERGE (c:Chunk {chunkId: $id, loadedAt: timestamp()}) and the graph keeps growing duplicate chunks on every run. Diagnose the bug and fix it.

The bug is MERGE on a volatile property. `MERGE (c:Chunk {chunkId: $id, loadedAt: timestamp()})` includes a timestamp in the match pattern, so every run produces a different pattern, never matches the existing node, and creates a duplicate — silently corrupting the graph at scale. Fix: MERGE on the stable identifier only, then write volatile fields in ON CREATE SET / ON MATCH SET: `MERGE (c:Chunk {chunkId: $id}) ON CREATE SET c.loadedAt = timestamp() ON MATCH SET c.updatedAt = timestamp()`. Key rule: MERGE's match pattern must contain only stable keys. Common wrong answer: blaming MERGE itself or a missing constraint — MERGE is correct; the bug is putting a volatile value inside its match pattern.

kgr-4-metadata-fields kg-construction ↑ in bank

What metadata is stored with each chunk, and why store it rather than just the text and its embedding?

Each chunk record stores its text plus metadata that preserves origin, position, and identity: item (which section), chunkSeqId (position within the section), formId (the filing it belongs to), chunkId (its unique key), source (a link back to the SEC filing), and the entity ids cik and cusip6. The point is provenance — you can rebuild document structure and link chunks to entities later only if you captured these at ingest. Common wrong answer: storing just the text and embedding; that loses order and origin, so you can never reconstruct NEXT order or trace a chunk to its company.

kgr-4-neo4jvector-params kg-construction ↑ in bank

What does Neo4jVector.from_existing_graph need in order to connect LangChain to an existing Neo4j vector index? Name the key parameters.

`Neo4jVector.from_existing_graph` connects LangChain to an index you already built, so it needs to know where the vectors live: the index_name (the vector index, e.g. "form10kChunks"), the node_label (which nodes, e.g. "Chunk"), the text_node_properties (which property holds the text, e.g. ["text"]), and the embedding_node_property (where the vector is stored, e.g. "textEmbedding") — plus the connection url/username/password and an embeddings object. Common wrong answer: expecting it to create the index or generate embeddings; from_existing_graph attaches to a pre-built index, it does not build one.

kgr-4-prevent-hallucination kg-construction ↑ in bank

A RAG system holding only NetApp’s filing answers a question about Apple using NetApp’s data. Which fix most directly addresses this?

Why

This is entity mis-attribution: the retriever returned NetApp’s nearest chunks and the prompt never told the model to decline. The direct fix is a prompt that restricts the model to the retrieved context and refuses when the entity is missing (“say you don’t know”), ideally paired with retrieval-score gating. Bigger chunks, higher dimensionality, and a different similarity function change retrieval mechanics but none of them stop the model from answering out-of-scope.

  1. Increase chunk size so each chunk carries more of the company's context
  2. Raise the vector-index dimensionality to make the embeddings more accurate
  3. Switch the index similarity function from cosine to Euclidean distance
  4. Instruct the prompt to use only the retrieved context and refuse missing entities

Correct: Instruct the prompt to use only the retrieved context and refuse missing entities

kgr-4-pure-vector-limits kg-construction ↑ in bank

After constructing the graph in this chapter, Neo4j holds embedded chunks but no relationships. Name three things this pure vector store cannot do, and what fixes them.

After this chapter the chunks are embedded but disconnected — Neo4j is functionally just a vector store. So you cannot follow a chunk back to its source filing, cannot retrieve the adjacent chunk for expanded context, and cannot traverse from a chunk to related entities (the company that filed it, other filings, investors). All three need relationships, which Module 5 adds as NEXT (chunk order), PART_OF (chunk to form), and SECTION links. Common wrong answer: "similarity search is broken" — vector search works fine; what is missing is *connection*, not retrieval.

kgr-4-rag-pipeline kg-construction ↑ in bank

Describe the three stages of the LangChain retrieval-QA chain over Neo4j, in order.

Three stages: connect, retrieve, generate. Connect — Neo4jVector wraps the graph as a standard vector store, running Cypher similarity search under the hood. Retrieve — the question is encoded with the same embedding model and matched against the chunk embeddings; the top-K nearest chunks come back. Generate — those chunks are stuffed into the LLM prompt as context, and the LLM answers grounded in them. Common wrong answer: putting generation before retrieval, or skipping retrieval — without the retrieved context the LLM is just answering from its parameters, which is the non-RAG baseline.

kgr-4-retrievalqa-stuff kg-construction ↑ in bank

In a "stuff" retrieval-QA chain, what does the retriever hand to the LLM?

Why

The "stuff" chain type “stuffs” the top-K retrieved chunks straight into the prompt as context, and the LLM answers from them. It does not reduce to a single chunk (that would be re-ranking), it does not pass Cypher (LLM-generated Cypher is Module 7’s approach), and it never hands over raw vectors — the LLM reads text, not embeddings.

  1. A single re-ranked chunk chosen as the single best match for the query
  2. The top-K retrieved chunks, concatenated into the prompt as context
  3. A generated Cypher query for the LLM to execute against the graph
  4. The raw embedding vectors for the LLM to decode back into text

Correct: The top-K retrieved chunks, concatenated into the prompt as context

kgr-4-when-graph-overkill kg-construction ↑ in bank

When is a pure vector store the right choice and when should you invest in graph relationships? Give the deciding factor, not just a preference.

The deciding question is whether the *queries* combine similarity with traversal, not whether the data has latent relationships. If retrieval is answered by semantic similarity alone — FAQ lookup, "find passages like this" — a pure vector store is simpler to operate and the graph adds overhead you never cash in. Add graph structure when retrieval needs connection: adjacent chunks, the parent document, the filing company, multi-hop context. This mirrors the Module 3 vector-DB rule (co-locate when queries mix vector and graph). Common wrong answer: "always use a graph because the data is connected" — relationships in the data only pay off if the queries actually traverse them.

Chapter 5

kgr-5-chunk-window-why kg-relationships ↑ in bank

What is a chunk window, and why does it improve RAG answers without changing the embedding model?

A chunk window returns the best-matching chunk plus its neighbours in the NEXT linked list, so the LLM sees continuous context instead of one isolated fragment. It matters because a single chunk often holds only part of an answer — a fact or detail can sit in the adjacent chunk that vector search didn't rank first. Expanding to a window (default 3 chunks, w = 2k + 1) recovers that surrounding context and improves answer completeness, all without changing the embedding model or re-running the search. Common wrong answer: "it improves the similarity score" — the window doesn't change scores; it changes how much context reaches the LLM.

kgr-5-custom-retrieval kg-relationships ↑ in bank

What is a custom retrieval query in Neo4jVector, and how does it let one retrieval step combine vector search with graph traversal?

A custom retrieval query is Cypher you pass to Neo4jVector through the `retrieval_query` argument. By default the retriever returns the matched chunk's text verbatim; the custom query runs after the vector match and can traverse the graph first — here following NEXT both directions to assemble a chunk window — before returning text to the LLM. That is exactly how vector search and graph traversal fuse into one retrieval step: the index finds the entry chunk, the Cypher tail expands it. Common wrong answer: thinking it replaces the vector search — it extends it, receiving the vector hit as input.

kgr-5-custom-retrieval-vars kg-relationships ↑ in bank

A custom retrieval query receives two variables from the vector index — name them — and sketch how it returns a chunk window to the LLM.

The custom query receives two values from the vector search: `node` (the matched Chunk) and `score` (its similarity). It typically starts `WITH node, score`, then matches a window such as `(:Chunk)-[:NEXT*0..1]->(node)-[:NEXT*0..1]->(:Chunk)`, keeps the longest with `ORDER BY length(window) DESC LIMIT 1`, unwinds the window's nodes, and returns each chunk's text, the score, and a metadata map. The shape the chain expects back is text, score, metadata. Common wrong answer: returning only the single node's text — that throws away the window you just traversed.

kgr-5-missing-filter-bug kg-relationships ↑ in bank

A 10-chunk document (Introduction seq 0–4, Methods seq 0–4, sequence ids restarting per section) was loaded with NEXT but no section filter. How many edges should exist, what does the bug add, and how would you detect the bad edges?

Under per-section numbering (Introduction seq 0–4, Methods seq 0–4), the correct build with the `item` filter gives 4 edges per section = 8 total. Without the filter, the `chunkSeqId + 1` join matches by number across the whole form, so each chunk also links to the next-numbered chunk in the other section (Introduction seq-3 → Methods seq-4, and so on) — spurious cross-section edges that fork the list. Detect them by querying for NEXT edges whose endpoints sit in different sections: `MATCH (a)-[:NEXT]->(b) WHERE a.item <> b.item RETURN a, b` — any rows are boundary bleed. Fix by deleting those edges and rebuilding with the `item` filter. Common wrong answer: assuming a single stray edge (one "last-of-A → first-of-B" link) — the unfiltered join adds a cross-section edge at every sequence number that lines up, not just one.

kgr-5-multihop-trace kg-relationships ↑ in bank

Trace the traversal from an arbitrary chunk to the first chunk of its own section. Which relationships do you follow, and in which direction?

Hop up, then down. From the chunk, follow PART_OF to its Form (`(c:Chunk)-[:PART_OF]->(f:Form)`), then follow SECTION from the form, filtered to the chunk's own section, down to that section's first chunk (`(f)-[s:SECTION]->(first:Chunk) WHERE s.f10kItem = c.item`). The full pattern is `(c:Chunk)-[:PART_OF]->(:Form)-[:SECTION]->(:Chunk)` with the SECTION's f10kItem matching the starting chunk's item. This is the two-hop trace the module's last objective asks for: chunk → form → section entry point. Common wrong answer: walking NEXT backward — it works only if no edge is missing and is O(n) in section length, whereas SECTION jumps in one hop.

kgr-5-next-cypher kg-relationships ↑ in bank

Write the Cypher that creates NEXT relationships between sequential chunks within a single form section. Which clauses keep the linked list from crossing section boundaries?

Match two chunks in the same section whose sequence ids differ by one, then MERGE a NEXT edge. The pattern is: match `(c1:Chunk), (c2:Chunk)` where `c1.formId = c2.formId`, `c1.item = c2.item`, and `c2.chunkSeqId = c1.chunkSeqId + 1`, then `MERGE (c1)-[:NEXT]->(c2)`. The two equality filters on formId and item are what keep the list inside one section; MERGE (not CREATE) makes it idempotent so re-running adds no duplicates. Common wrong answer: ordering only by chunkSeqId without the item filter — that links the last chunk of one section to the first of the next.

kgr-5-next-direction kg-relationships ↑ in bank

A NEXT relationship is created between two chunks when which condition holds?

Why

NEXT links consecutive chunks within one section: same formId, same item, and chunkSeqId differing by exactly one. Matching across different sections is the bug the section filter exists to prevent; NEXT is about document order, not cosine similarity; and it stays within a single form, not across a company’s filings.

  1. They share a formId and item and their chunkSeqId differ by one
  2. They share a formId but come from two different sections
  3. They have the highest cosine similarity to each other
  4. They belong to the same company across different forms

Correct: They share a formId and item and their chunkSeqId differ by one

kgr-5-schema-design kg-relationships ↑ in bank

Design a graph schema for technical documentation (books contain chapters contain sections, each section has embedded chunks). Give the node and relationship types, and say which relationship serves each of: sequential reading, section jump, chapter jump, and similarity search.

Reuse the SEC pattern. Node types: Book, Chapter, Section, Chunk (with text + embedding). Membership runs upward with PART_OF (Chunk to Section, Section to Chapter, Chapter to Book), and order runs along NEXT (Chunk to Chunk within a section). For direct jumps, add entry-point relationships — Book to Chapter and Chapter to Section — so a reader can jump by name without walking the whole hierarchy. Each navigation pattern maps to one mechanism: sequential reading follows NEXT, section/chapter jumps follow the entry-point edges, upward context follows PART_OF, and similarity uses the vector index on Chunk embeddings then PART_OF for context. Key design call: which entry-point edges to materialise — they trade a little storage for direct navigation.

kgr-5-section-filter kg-relationships ↑ in bank

Why is filtering by section (not just by form) necessary when creating NEXT relationships, and what concretely breaks if you omit it?

Section filtering is a correctness requirement because chunk sequence ids are assigned within a section (they restart at 0), and NEXT links chunks whose ids differ by one. Filter only on formId and the `chunkSeqId + 1` join matches by number across the whole form — so a chunk links to the next-numbered chunk in every section, not just its own, forking the linked list across section boundaries. A chunk window built at a boundary then pulls in unrelated context from an adjacent section, degrading answer quality. The fix: require matching formId AND item on both chunks when creating NEXT. Common wrong answer: picturing the bug as one stray edge — the unfiltered join adds a cross-section edge wherever sequence numbers line up.

kgr-5-star-notation kg-relationships ↑ in bank

In a chunk-window query, what does *0..1 in -[:NEXT*0..1]-> do?

Why

*0..1 is a variable-length path matching zero or one NEXT hop. The “zero” is the point: at the first or last chunk the missing side matches no hop, so the pattern still returns the available neighbours instead of failing. It is not a fixed single hop (that is the pattern that breaks at boundaries), not a 0–10 range, and not a score filter.

  1. It repeats the NEXT hop exactly once on each side of the target
  2. It matches between 0 and 10 NEXT hops to build a wide window
  3. It filters chunks whose similarity score falls between 0 and 1
  4. It matches zero or one NEXT hop, so the window still works at list boundaries

Correct: It matches zero or one NEXT hop, so the window still works at list boundaries

kgr-5-three-relationships kg-relationships ↑ in bank

Which option correctly maps the three relationship types added in Module 5 to their roles?

Why

NEXT orders chunks within a section (the linked list), PART_OF points a chunk up to its parent form (document membership), and SECTION points a form down to the first chunk of each section (the entry point, tagged f10kItem). The near-miss differs only in the SECTION clause — SECTION links to a section’s first chunk, not to a company; companies don’t enter until Module 6.

  1. NEXT orders chunks; PART_OF joins a chunk to its form; SECTION links a form to its company
  2. NEXT joins a form to its sections; PART_OF orders chunks; SECTION links a chunk to the next
  3. NEXT links a chunk to its company; PART_OF indexes sections; SECTION orders chunks by position
  4. NEXT orders chunks; PART_OF joins a chunk to its form; SECTION links a form to each section's first chunk

Correct: NEXT orders chunks; PART_OF joins a chunk to its form; SECTION links a form to each section's first chunk

kgr-5-trace-path kg-relationships ↑ in bank

You hold a chunk in the middle of a section and want that section’s first chunk. Which traversal gets you there?

Why

The chunk reaches its section’s entry point by hopping up to the form with PART_OF, then down with SECTION (which points at the seq-0 chunk): (:Chunk)-[:PART_OF]->(:Form)-[:SECTION]->(:Chunk). Following NEXT forward lands on the last chunk, not the first; PART_OF goes to a form not a company; and a vector search on the title is neither reliable nor structural.

  1. Chunk to Form via PART_OF, then Form to the section's first chunk via SECTION
  2. Follow NEXT forward from the chunk to the end of the section
  3. Chunk to Company via PART_OF, then Company to Form via SECTION
  4. Run a vector search for the section title and take the top hit

Correct: Chunk to Form via PART_OF, then Form to the section's first chunk via SECTION

kgr-5-tradeoff-hybrid kg-relationships ↑ in bank

Graph-augmented retrieval beat pure vector retrieval on quality but cost more per query. Give the deciding factor for choosing each, and design a hybrid that captures most of the gain at lower cost.

Graph augmentation bought a real quality gain in the support-RAG comparison — roughly 78 to 89 percent accuracy and 62 to 84 percent completeness — but at higher latency and about double the monthly cost. The deciding factor is whether questions actually span chunk or document boundaries and whether answer quality drives value; for self-contained FAQ lookups the pure vector path is enough. The hybrid captures most of the gain cheaply: run vector search first, then trigger the graph window only on a cheap signal — a low top similarity score or a multi-entity question — so the expensive traversal fires on the ~30 to 40 percent of queries that need it, keeping average cost near the vector baseline. Common wrong answer: "always add the graph because accuracy is higher" — you pay for traversal you don't always use.

kgr-5-var-length-meaning kg-relationships ↑ in bank

What does a variable-length path like *0..1 match, and why is it needed at the start and end of a chunk linked list?

A variable-length path matches a *range* of hops instead of a fixed count: `*0..1` matches zero or one NEXT hop, `*0..2` zero to two, `*1..3` one to three. It is needed at linked-list boundaries because a fixed two-hop window (before to middle to after) has nothing to match at the first chunk (no predecessor) or last chunk (no successor), so it returns zero rows. With `*0..1` the missing side matches zero hops — the endpoint collapses onto the target — so the query still returns the neighbours that do exist. Pair it with `ORDER BY length(window) DESC LIMIT 1` to keep the longest window available at each position. Common wrong answer: reading `*0..1` as "exactly one hop" — the zero is the point, and it is what lets the pattern survive at the list ends.

kgr-5-vector-plus-graph kg-relationships ↑ in bank

In the NetApp example, what did adding a chunk window reveal that single-chunk retrieval missed?

Why

The single best-matching chunk gave the high-level answer (“enterprise storage and data management”); the window followed NEXT into the adjacent chunk and surfaced the specific Keystone product detail. Vector search found the entry point, graph traversal added the surrounding context — the window does not change the score, add a model, or speed anything up.

  1. A higher cosine similarity score for the top-matching chunk
  2. A second embedding model that improved overall recall
  3. The Keystone product detail held in an adjacent chunk
  4. A faster response by skipping the graph traversal step

Correct: The Keystone product detail held in an adjacent chunk

kgr-5-window-comparison kg-relationships ↑ in bank

Compare the RAG answer for “What is NetApp’s primary business?” with and without a chunk window. What specifically does the window add, and why?

Without a window, retrieval returns the single best-matching chunk, so the answer is whatever that one fragment holds — for "What is NetApp's primary business?" that was "enterprise storage and data management." With a window, the query follows NEXT to the adjacent chunks and the LLM also sees the Keystone product detail that lived next door, producing a more complete answer. The lesson generalises: vector search picks the entry point, graph traversal supplies the surrounding context that a lone chunk omits. Common wrong answer: "the windowed result is more accurate because the score is higher" — accuracy/completeness rises from added context, not a changed score.

kgr-5-window-size kg-relationships ↑ in bank

A chunk window takes k hops in each direction. For k = 2, how many chunks does a full (non-boundary) window contain?

Why

The window size is w = 2k + 1: the target plus k neighbours on each side. For k = 2 that is 2(2) + 1 = 5 chunks. The “4” answer forgets to count the target itself, “3” is the k = 1 default, and “3k + 1” is not the window formula.

  1. 4 chunks, with k chunks on each side of the target
  2. 5 chunks — the target plus two on each side
  3. 3 chunks, the same as the default window
  4. 7 chunks, following a 3k plus 1 rule

Correct: 5 chunks — the target plus two on each side

Chapter 6

kgr-6-company-id kg-expansion ↑ in bank

Which property is the unique identifier when creating Company nodes?

Why

Company nodes are MERGE’d on cusip6 — the issuer identifier that both datasets carry, which is exactly why it can bridge Company to Form. managerCik keys Manager nodes, companyName is an unstable display string (poor key), and formId identifies a filing, not a company.

  1. managerCik, the investment firm's Central Index Key
  2. cusip6, the six-digit company identifier
  3. companyName, the company's display name
  4. formId, the filing's unique identifier

Correct: cusip6, the six-digit company identifier

kgr-6-complete-schema kg-expansion ↑ in bank

List the complete graph schema after Module 6: all node types, all relationship types, and the indexes.

Four node types — Chunk (text + embedding), Form (the 10-K), Company, Manager. Five relationship types — Chunk NEXT Chunk (linked list), Chunk PART_OF Form (membership), Form SECTION Chunk (section entry), Company FILED Form (filing, via CUSIP), Manager OWNS_STOCK_IN Company (investment, with shares/value/quarter). Two indexes — a vector index on Chunk.textEmbedding and a full-text index on Manager.managerName. Together these support semantic search (vector), keyword search (full-text), sequential and hierarchical document navigation (NEXT/PART_OF/ SECTION), and cross-entity traversal (FILED/OWNS_STOCK_IN). Common wrong answer: forgetting the indexes — they are part of the schema's capability, not an afterthought.

kgr-6-cusip-breaks kg-expansion ↑ in bank

Automatic CUSIP-based linking between Company and Form nodes would fail in which situation?

Why

Linking on a shared key works only if both datasets actually share that key. If one identifies companies by CUSIP and the other by ISIN or ticker, there is no common value to MERGE on, so you need an entity-resolution layer. Shared CUSIPs are exactly what makes linking work; node creation order is irrelevant to MERGE; and a manager holding several companies is normal, not a failure.

  1. When both datasets store the same CUSIP 6 identifier
  2. When the Company node is created before the Form node
  3. When one manager owns stock in several companies at once
  4. When the two datasets use different id schemes (CUSIP versus ticker)

Correct: When the two datasets use different id schemes (CUSIP versus ticker)

kgr-6-cusip-bridge kg-expansion ↑ in bank

How does the CUSIP identifier let you link the independently created Company nodes to the existing Form nodes?

CUSIP 6 is a universal company identifier that appears in *both* the Form 10-K data (on Form nodes) and the Form 13 data (on Company nodes), so the two independently built node sets connect with no manual matching: `MATCH (com:Company), (f:Form) WHERE com.cusip6 = f.cusip6 MERGE (com)-[:FILED]->(f)`. This is the canonical cross-dataset linking pattern — a shared, stable, universal key lets MERGE join records that were never created together. Common wrong answer: matching on company name, which is unstandardised and produces both missed and wrong matches.

kgr-6-entity-resolution kg-expansion ↑ in bank

CUSIP-based linking is automatic here. What breaks when you integrate two datasets that identify companies differently, and how do you handle it?

CUSIP linking works because it is a universal identifier shared by both datasets. When two sources use *different* schemes — one CUSIP, another ISIN or ticker, or only company-name strings — there is no common value to MERGE on, and automatic linking silently produces missed matches (false negatives) and wrong matches (false positives). The fix is an entity-resolution layer: prefer a canonical id where one exists (DUNS, SEC CIK); otherwise fuzzy-match names, verify with secondary attributes (address, industry), and store a canonical node with an `aliases` property for the variants. Entity resolution is usually the hardest part of KG integration — the join is easy once the keys agree. Common wrong answer: "just match on name" — that is precisely what fails.

kgr-6-four-hop kg-expansion ↑ in bank

Which hop sequence reaches the investment managers, starting from a text chunk?

Why

The path crosses both datasets: PART_OF up to the Form, FILED across to the Company (traversed backward from the form), then OWNS_STOCK_IN to each Manager (backward from the company). PART_OF goes to a form not a company; NEXT links chunks, not chunk-to-manager; and SECTION points a form at its first chunk, not at companies.

  1. Chunk to Company via PART_OF, then Company to Manager via SECTION edges
  2. Chunk to Form (PART_OF), Form to Company (FILED), Company to Manager (OWNS_STOCK_IN)
  3. Chunk to Manager via NEXT, then Manager to Company via FILED edges
  4. Chunk to Form via SECTION, then Form to Manager via OWNS_STOCK_IN

Correct: Chunk to Form (PART_OF), Form to Company (FILED), Company to Manager (OWNS_STOCK_IN)

kgr-6-fulltext-vs-vector kg-expansion ↑ in bank

How does a full-text index differ from a vector index, and when would you use each?

A full-text index matches *strings* — exact, partial, and fuzzy keyword search with relevance scoring (e.g. find managers matching "Royal Bank"). A vector index matches *meaning* — embeddings whose semantics are close, even when the words differ. Use full-text when the query is a name or keyword whose spelling matters; use vector when you want conceptually similar content regardless of wording. They are complementary, and Neo4j adds a third — property indexes for exact lookup by a known id. Common wrong answer: "full-text is just a slower vector search" — they answer different questions (spelling vs meaning), not the same question at different speeds.

kgr-6-graph-to-text kg-expansion ↑ in bank

What is graph-to-text context generation, and why is it needed in a graph-RAG pipeline? Give an example sentence.

Graph-to-text context generation converts structured traversal results — relationship triples of subject, predicate, object — into natural-language sentences the LLM can read as context. For an OWNS_STOCK_IN result you might emit "Royal Bank of Canada owns 1,200,000 shares of NetApp." The reason is simple: the traversal produces precise, connected facts, but an LLM consumes text, so the sentence form is how those facts enter the prompt. It is the consume direction of the graph-LLM loop; Module 7 inverts it by having the LLM generate the query. Common wrong answer: passing raw rows or JSON — workable, but prose is what the model reasons over most reliably.

kgr-6-investment-rels kg-expansion ↑ in bank

Which two relationships bridge the Form 10-K document data and the Form 13 investment data, and how is each created?

Two relationship types bridge the datasets. FILED connects Company to Form, created by matching on the shared cusip6 — `MATCH (com:Company), (f:Form) WHERE com.cusip6 = f.cusip6 MERGE (com)-[:FILED]->(f)`. OWNS_STOCK_IN connects Manager to Company, created per CSV row by matching both endpoints on their keys and MERGE-ing the edge with shares, value, and quarter — and because it is MERGE'd on `reportCalendarOrQuarter`, the same manager-company pair can hold one edge per reporting period. FILED is the document bridge (10-K data); OWNS_STOCK_IN is the investment bridge (Form 13 data). Common wrong answer: a single edge per manager-company pair, which would overwrite quarter-by-quarter history.

kgr-6-multihop-compose kg-expansion ↑ in bank

Write a Cypher query that finds the top 5 managers by total NetApp investment value and then returns each one’s name with the text of the filing’s business section (Item 1). What lets you compose the two steps?

Compose with a WITH clause acting as a pipeline stage. First aggregate to get the top managers: `MATCH (mgr:Manager)-[owns:OWNS_STOCK_IN]->(com:Company) WITH mgr, sum(owns.value) AS total ORDER BY total DESC LIMIT 5`. Then continue the traversal from those managers down to document content: `MATCH (mgr)-[:OWNS_STOCK_IN]->(:Company)-[:FILED]->(f:Form)-[s:SECTION {f10kItem: "item1"}]-> (c:Chunk) RETURN mgr.managerName, total, c.text`. The WITH clause carries the top-5 result into the second match, the way a SQL subquery would, but reads as a sequential narrative. Common wrong answer: trying to filter the aggregate in a WHERE before WITH — aggregation results must pass through WITH first.

kgr-6-node-schema kg-expansion ↑ in bank

Design the Company and Manager node schemas for the Form 13 data: which identifier keys each node, and why those choices?

Two node types, each keyed by a stable identifier. Company: keyed by `cusip6` (with companyName, cusip), because CUSIP is the universal id that also appears on the Form nodes — making it the cross-dataset link. Manager: keyed by `managerCik` (with managerName, managerAddress), because the CIK uniquely identifies a filing firm. Both are created with MERGE on those keys for idempotency. The design point: choose the key that is both stable *and* shared with the data you intend to link to — that is why Company uses CUSIP (shared with forms) rather than companyName. Common wrong answer: keying Company on companyName, which is an unstable, non-unique string.

kgr-6-owns-properties kg-expansion ↑ in bank

What does the OWNS_STOCK_IN relationship carry as properties?

Why

The holding’s facts live on the edge: shares, value, and reportCalendarOrQuarter (the quarter is part of its identity, since a manager can hold the same stock across periods). Names and addresses belong on the Manager node; the linking ids key the nodes, not the edge; and embeddings sit on Chunk nodes, not investment edges.

  1. The company name and the manager's mailing address
  2. The CUSIP and CIK linking identifiers for both ends
  3. Share count, monetary value, and the reporting quarter
  4. The vector embedding of the holding's description

Correct: Share count, monetary value, and the reporting quarter

kgr-6-question-alignment kg-expansion ↑ in bank

The investment-enhanced chain dramatically improved the answer to “Who are NetApp’s investors?” but not to “Tell me about NetApp.” Explain why, and state the design lesson.

Because graph context helps only when the question needs it. "Who are NetApp's investors?" is answered by the OWNS_STOCK_IN data, so the investment chain returns specific names and share counts while the plain chain is vague. "Tell me about NetApp" is a general business question the investor data does not address, so the LLM ignores the extra context and both chains answer about the same — except the investment chain paid extra latency and tokens to fetch context that went unused. The lesson: expand context to match the expected question distribution; unmatched expansion is noise, not value. Common wrong answer: "more context is always better" — irrelevant context is dead weight the model discards.

kgr-6-roi-calc kg-expansion ↑ in bank

An FAQ system is at 72% satisfaction (target 85%). Of its failures, 40% are product-mismatch and 35% are missing-comparison. If a graph schema cuts each of those modes by 60%, does it reach the target? Show the arithmetic.

Work in shares of all answers. The failure rate is 100 − 72 = 28 percent. Mode 1 is 28 × 0.40 = 11.2 percent of answers; Mode 2 is 28 × 0.35 = 9.8 percent. Cutting each by 60 percent removes 0.6 × 11.2 = 6.7 and 0.6 × 9.8 = 5.9 points, so the new failure rate is 28 − 6.7 − 5.9 = 15.4 percent and satisfaction rises to about 84.6 percent — essentially the 85 percent target. The ceiling worth noting: only the 75 percent of failures from these two modes is addressable; the remaining 25 percent (other causes) needs different fixes, so you cannot exceed roughly 85 percent by attacking these two alone. Common wrong answer: applying the 60 percent reduction to the whole 28 percent failure rate, which overcounts the gain.

kgr-6-schema-trace kg-expansion ↑ in bank

A user asks “Which investment managers hold the company described in this filing, and how much?” Trace the end-to-end path through the schema, and say why pure vector search could not answer it.

Trace it across both datasets. From a relevant chunk: PART_OF up to its Form; FILED backward from the form to the Company that filed it; OWNS_STOCK_IN backward from the company to each Manager, reading the holding's value and shares; optionally sort by value and take the top holders. So `(:Chunk)-[:PART_OF]->(:Form)<-[:FILED]-(:Company)<-[:OWNS_STOCK_IN]-(:Manager)`, then a graph-to-text sentence per manager for the LLM. This is exactly the question vector search alone cannot answer: similarity finds *text about* NetApp, but only the relationships reach the *investors*, which live in a second dataset linked by CUSIP. Common wrong answer: expecting vector search to surface investors — they are not in the chunk text at all.

kgr-6-search-paradigms kg-expansion ↑ in bank

Neo4j offers three search paradigms in one engine. Which mapping is correct?

Why

Vector indexes match meaning (semantic similarity over embeddings), full-text indexes match spelling (exact/partial/fuzzy keyword search with relevance scores), and property indexes do exact lookup by a known identifier. The distractors swap these roles — e.g. vector does not do keyword matching, and full-text does not index embeddings.

  1. Vector for meaning, full-text for keywords, property for exact lookup
  2. Vector for keywords, full-text for meaning, property for fuzzy match
  3. Full-text for embeddings, vector for strings, property for relevance
  4. Property for meaning, vector for exact lookup, full-text for ids

Correct: Vector for meaning, full-text for keywords, property for exact lookup

kgr-6-triple-to-sentence kg-expansion ↑ in bank

Why convert graph-traversal results into natural-language sentences before passing them to the LLM?

Why

An LLM consumes text, so a relationship triple (manager–owns–company) is turned into a readable sentence the model can reason over as context — that is how structured graph facts enter a RAG prompt. It is not about storage compression, encryption, or building another embedding index.

  1. To compress the result set and save storage space in the graph
  2. To encrypt sensitive investment values before they are displayed
  3. So the LLM can read the relationship facts as context it reasons over
  4. To generate fresh embeddings for a second vector index

Correct: So the LLM can read the relationship facts as context it reasons over

Chapter 7

kgr-7-address-node graph-rag-chat ↑ in bank

What properties does an Address node carry, and how does Neo4j store geospatial data?

An Address node holds city, state, and a geospatial `location` property — a latitude/longitude point created with `point({latitude: x, longitude: y})`. Neo4j stores geospatial data as this native point type, which `point.distance()` consumes to compute distances in metres. Address nodes connect to Company and Manager nodes via LOCATED_AT, so the graph can answer "what's near X?" questions. They are themselves Extract-Enhance-Expand: extract from CSV, enhance with the point, expand with LOCATED_AT. Common wrong answer: storing latitude and longitude as two plain numbers — they belong in a single point value so point.distance() and a point index can use them.

kgr-7-chain-steps graph-rag-chat ↑ in bank

What is the correct order of the GraphCypherQAChain pipeline?

Why

The chain runs question → LLM generates Cypher → Neo4j executes it → LLM synthesises a natural-language answer from the rows. Generation must precede execution; the chain uses LLM-generated Cypher (not a vector-search pipeline), and it does not produce embeddings.

  1. Execute Cypher, generate Cypher, ask the question, answer
  2. Question, generate Cypher, execute it, synthesise the answer
  3. Question, vector search, rank chunks, answer
  4. Generate embeddings, search, traverse, then generate Cypher

Correct: Question, generate Cypher, execute it, synthesise the answer

kgr-7-cypher-chain-pipeline graph-rag-chat ↑ in bank

Describe the four steps of the GraphCypherQAChain pipeline, in order, and one thing it needs to generate reliable Cypher.

Four steps. (1) Question — the user asks in natural language. (2) Generate — the LLM writes Cypher from the injected schema and few-shot examples. (3) Execute — Neo4j runs the generated query. (4) Answer — the LLM synthesises the returned rows into a natural-language reply. The chain (GraphCypherQAChain) needs the schema and a good few-shot prompt for reliable generation, and should validate the query before executing in production. Common wrong answer: putting execution before generation, or assuming it is vector retrieval — this pipeline generates and runs Cypher, it does not embed and search.

kgr-7-distance-units graph-rag-chat ↑ in bank

To find entities within 25 km, what value do you compare point.distance() against?

Why

point.distance() returns metres, so 25 km is < 25000. The off-by-1000 bug — writing 25 (km) instead of 25000 — is the most common geospatial mistake; the function never returns kilometres or centimetres.

  1. 25, since the function returns kilometres
  2. 0.025, converting the radius to a fraction
  3. 2500, since the function returns centimetres
  4. 25000, since the function returns metres

Correct: 25000, since the function returns metres

kgr-7-eee-phases graph-rag-chat ↑ in bank

Describe the three phases of the Extract-Enhance-Expand pattern, with one course example of each.

Three phases, repeated for every data source. Extract — pull source data into nodes (split text into Chunks; build Company, Manager, Address nodes from CSV). Enhance — add indexes or computed properties (vector embeddings on chunks, a full-text index on manager names, geospatial points on addresses). Expand — connect the new nodes into the existing graph with relationships (NEXT and PART_OF for chunks, FILED and OWNS_STOCK_IN for the investment data, LOCATED_AT for addresses). It is the repeatable framework the whole course used to grow the graph incrementally. Common wrong answer: collapsing Enhance into Extract — adding the index is a distinct phase, and it is what makes the nodes searchable.

kgr-7-eee-stage graph-rag-chat ↑ in bank

Creating a full-text index on manager names is which phase of Extract-Enhance-Expand?

Why

Adding an index (or computed property) to nodes that already exist is the Enhance phase — like vector embeddings or geospatial points. Extract creates the nodes, Expand connects them with relationships, and validation isn’t one of the three phases.

  1. Enhance — it adds an index to existing nodes
  2. Extract — it pulls new nodes from the source data
  3. Expand — it connects nodes with a new relationship
  4. A separate validation phase outside the pattern

Correct: Enhance — it adds an index to existing nodes

kgr-7-fewshot-design graph-rag-chat ↑ in bank

Design a 3-example few-shot prompt that teaches an LLM to answer single-hop, multi-hop, and aggregation questions over a graph. What principle governs the choice of examples?

Inject the schema, instruct "use only these types; do not hallucinate", then give three examples that each teach a *different* query shape. For a university graph: (1) single-hop with a filter — "Who teaches Machine Learning?" maps to matching a Professor TEACHES a Course with that title; (2) multi-hop — "What courses do Dr. Smith's students take?" maps to Professor ADVISES Student ENROLLED_IN Course; (3) aggregation — "Which departments have the most courses?" maps to Course BELONGS_TO Department with count and ORDER BY. The design rule: one example per capability (filter, traverse, aggregate), because the LLM generalises the *shape*, not the specific question. Common wrong answer: three near-identical filter examples — they teach one pattern and waste the few-shot budget.

kgr-7-fewshot-diversity graph-rag-chat ↑ in bank

Is it better to give a Cypher-generation prompt ten similar examples or three diverse ones? Explain the trade-off.

Prefer a few diverse examples. Two or three that span *different* query shapes — a property filter, an aggregation, a multi-hop traversal — teach the LLM more than ten variations of one shape, because the model generalises patterns and redundant examples add no new pattern. Past two or three diverse examples, returns diminish quickly while prompt length (and cost) grows. So the selection rule is coverage of distinct shapes, not quantity. Common wrong answer: "more examples always help" — beyond covering the distinct shapes, extra similar examples mostly add tokens, not capability.

kgr-7-fewshot-why graph-rag-chat ↑ in bank

Why inject the graph schema and a “do not hallucinate” instruction into a Cypher-generation prompt?

Why

The schema plus the instruction constrain the LLM to real relationship types and properties, preventing schema hallucination — Cypher that looks valid but references things the graph doesn’t have. It doesn’t change execution speed, doesn’t remove the need for examples, and doesn’t bypass running the query.

  1. To stop the LLM inventing relationship types that aren't in the graph
  2. To make the generated Cypher run faster at query time
  3. To reduce the number of few-shot examples needed to zero
  4. To let the LLM skip executing the query against Neo4j

Correct: To stop the LLM inventing relationship types that aren't in the graph

kgr-7-four-paradigms graph-rag-chat ↑ in bank

Name the four retrieval paradigms the finished knowledge graph supports, with a question type suited to each.

Four, each suited to a different question. Vector search — semantic similarity for conceptual questions ("tell me about cloud storage"). Full-text search — keyword/string lookup for entities ("find Royal Bank of Canada"). Graph traversal — relationship following for structural questions ("who are NetApp's investors?"). Geospatial search — distance with point.distance() for location questions ("companies within 10 km of San Jose"). No single paradigm answers every question; the finished graph's power is holding all four in one engine, and complex questions compose them. Common wrong answer: treating vector search as a catch-all — it finds meaning, not exact names, relationships, or distances.

kgr-7-geospatial-query graph-rag-chat ↑ in bank

Write a Cypher query that finds companies within 20 km of San Jose, sorted nearest-first. What unit trap must you avoid, and what index keeps it fast?

Match the reference location, match candidate companies through LOCATED_AT, filter by distance, and sort. For example: `MATCH (sj:Address {city: "San Jose"}) MATCH (com:Company)-[:LOCATED_AT]-> (a:Address) WHERE point.distance(sj.location, a.location) < 20000 RETURN com.companyName, point.distance(sj.location, a.location)/1000 AS km ORDER BY km ASC`. The threshold is 20000 because `point.distance()` returns metres (20 km, not 20). For performance, a point index on the location property turns the scan into an index range query. Common wrong answer: writing `< 20`, which asks for entities within 20 metres and returns almost nothing.

kgr-7-located-at graph-rag-chat ↑ in bank

What role does the LOCATED_AT relationship play, and how is it used in a query that finds managers near a company?

LOCATED_AT connects a Company or Manager to its Address node, so any geographic question first traverses it to reach coordinates. To find managers near a company: full-text or match the company, traverse `(:Company)-[:LOCATED_AT]->(:Address)` for its location, traverse `(:Manager)-[:LOCATED_AT]->(:Address)` for candidate locations, then filter with `point.distance(companyAddr.location, mgrAddr.location) < radius`. Without LOCATED_AT the entities have no location to compare, so distance queries are impossible — it is the Expand step that makes geography first-class. Common wrong answer: putting coordinates directly on the Company node — modelling Address as its own node lets multiple entities share a location and keeps the point in one indexed place.

kgr-7-paradigm-match graph-rag-chat ↑ in bank

“Find investment firms within 10 km of Apple’s headquarters.” Which retrieval approach does this question need?

Why

It composes paradigms: full-text (or exact lookup) to resolve “Apple” to its node and address, then a geospatial point.distance() filter for firms within 10 km. No single paradigm suffices — vector handles meaning not distance, traversal alone can’t compute proximity, and full-text alone can’t filter by radius.

  1. Vector search alone, for semantic similarity
  2. Graph traversal alone, following relationships
  3. Full-text search alone, to match the firm names
  4. Full-text to find Apple, then geospatial distance to nearby firms

Correct: Full-text to find Apple, then geospatial distance to nearby firms

kgr-7-paradigm-select graph-rag-chat ↑ in bank

For each question, name the retrieval paradigm(s) and why: (a) “What are the risk factors for cloud-computing companies?” (b) “Find Goldman Sachs.” (c) “Which investors are within 50 miles of NetApp?”

Decompose each into paradigm-matched steps. (a) "Risk factors for cloud-computing companies" — vector search for chunks about risk factors, then graph traversal (PART_OF to Form to Company) to confirm they belong to cloud companies. (b) "Find Goldman Sachs" — full-text search; it is an entity lookup by name, and vector would wrongly surface similar names like Morgan Stanley. (c) "Investors within 50 miles of NetApp" — full-text to find NetApp, LOCATED_AT to its address, geospatial point.distance under 80467 m for nearby managers, then OWNS_STOCK_IN to confirm they invest. The skill is reading a question as a sequence of retrieval steps. Common wrong answer: forcing one paradigm — most real questions combine two or more.

kgr-7-prevent-hallucination graph-rag-chat ↑ in bank

An LLM generates Cypher referencing a relationship type that doesn’t exist in your graph. Which fix most directly addresses this?

Why

Schema hallucination is fixed by grounding the model in the real schema and checking its output: inject the actual node/relationship types, instruct “use only these”, and validate before executing. Raising temperature makes invention more likely, removing the schema removes the guardrail, and the search paradigm is unrelated to the generation step.

  1. Raise the LLM's temperature for more creative queries
  2. Remove the schema from the prompt to simplify it
  3. Inject the real schema and validate the query before running it
  4. Switch the retrieval from full-text search to vector search

Correct: Inject the real schema and validate the query before running it

kgr-7-progressive-fewshot graph-rag-chat ↑ in bank

Why does adding one “What does company X do?” example to the few-shot prompt unlock document navigation for many companies, not just X?

Because the example demonstrates a reusable query *shape*, not a one-off answer. "What does X do?" shows full-text find the company, then traverse FILED to its Form and SECTION to the Item-1 chunk, and return that text. The LLM generalises this shape to any company name, so the single example unlocks document navigation broadly. That is progressive few-shot: each added example teaches one new capability (city filter, then geospatial distance, then document navigation) the model applies to unseen questions. Common wrong answer: assuming it only answers about that one company — the model learns the pattern, not the instance.

kgr-7-schema-hallucination graph-rag-chat ↑ in bank

What is schema hallucination in LLM-generated Cypher, how would you detect it, and how do you prevent it?

Schema hallucination is when an LLM generates Cypher that references relationship types or properties the graph does not have — syntactically valid, plausible-looking, but invalid against the real schema (so it errors or returns nothing). Catch it by reading the generated query against the schema before trusting the answer; with verbose mode on, inspect the Cypher the chain produced. Prevent it by injecting the actual schema, instructing "use only these types; do not hallucinate", giving few-shot examples that use the correct relationships, and validating the query before executing in production. Common wrong answer: "raise the temperature for better queries" — higher temperature increases invention, making hallucination worse.