Practice questions
108 questions across 7 domains. Try each before revealing the answer.
Take a scored practice exam: a random form sampled from the bank below, with a per-domain score readout.
KG Fundamentals
Give two real-world applications of knowledge graphs outside RAG, and state what the graph encodes in each.
Show answer
Two of: Web search — knowledge cards (a celebrity's birth date, films, related people) are graph look-ups that aggregate a node's properties and its relationships into a summary panel; the graph encodes entities (people, companies, places) linked by typed relationships. E-commerce / recommendations — "customers who bought X also bought Y" traverses purchase relationships, and multi-hop preferences fall out of the graph; it encodes users, products, and purchase/view edges. Financial analysis — investments, corporate structures, and filings form a natural graph of companies, managers, and ownership relationships (the SEC dataset this guide uses). In each, the graph encodes *entities and the typed relationships between them*. Common wrong answer: "knowledge graphs are only useful for RAG" — they predate RAG and power search, recommendations, and fraud detection.
Write the Cypher text notation for: Professor “Dr. Smith” teaches “Machine Learning 101”, which belongs to the “CS” department. Name the labels and relationship types you used.
Show answer
Write nodes in parentheses (label after a colon, properties in braces), relationships in square brackets with a typed arrow. For "Dr. Smith teaches Machine Learning 101, which belongs to the CS department": `(smith:Professor {name: "Dr. Smith"})-[:TEACHES]->(ml101:Course {title: "Machine Learning 101"})-[:BELONGS_TO]->(cs:Department {name: "CS"})`. Node labels: `Professor`, `Course`, `Department`; relationship types: `TEACHES`, `BELONGS_TO` (both directed). Properties like `name`/`title` sit inside the node braces; a relationship could carry its own property too (e.g. `[:TEACHES {term: "Fall"}]`). Common wrong answer: putting the relationship type in parentheses or omitting the arrow direction — Cypher needs `-[:TYPE]->` for a directed, typed relationship.
Which Cypher pattern correctly expresses “a Person ACTED_IN a Movie”?
Why
Cypher puts nodes in parentheses and relationships in square brackets with a typed
arrow: (p:Person)-[:ACTED_IN]->(m:Movie). The variant that swaps the brackets (nodes in [],
relationship in ()) inverts the notation; the =>{...}=> form isn’t Cypher syntax; and
wrapping the relationship type in plain parentheses omits the square brackets that mark a
relationship. The bracket convention is what makes the pattern read like the graph it matches.
Show answer
Correct: (p:Person)-[:ACTED_IN]->(m:Movie)
Define the four core components of a knowledge graph (node, relationship, property, label) and give one example of each from a movie dataset.
Show answer
Node — a data record representing an entity, e.g. a `Movie` node (graph-theory: vertex). Relationship — a directed, typed connection between two nodes, e.g. `ACTED_IN` from a Person to a Movie (graph-theory: edge, but richer). Property — a key-value pair on a node or relationship, e.g. a Movie's `title` or `released` year, or a relationship's `since`. Label — a tag grouping similar nodes, e.g. `Person` or `Movie`. Together: labelled nodes carry properties and are joined by typed, directed relationships that also carry properties. Common wrong answer: "a relationship is just an edge with no extra information" — relationships carry a type, direction, and their own properties, which is the whole reason the term is preferred.
Model a small knowledge graph for an e-commerce “customers who bought X also bought Y” recommender: name the node labels, the relationship type(s), and the traversal that produces a recommendation.
Show answer
Nodes (labels + properties): `Customer` (name, id), `Product` (title, category, price). One relationship type: `(:Customer)-[:PURCHASED {date}]->(:Product)`. The recommendation "customers who bought X also bought Y" is a multi-hop traversal: from product X, go *back* along `PURCHASED` to the customers who bought it, then *forward* along their other `PURCHASED` relationships to the products Y they also bought, ranked by frequency — i.e. `(x:Product)<-[:PURCHASED]-(c:Customer)-[:PURCHASED]->(y:Product)`. No separate "co-purchase" table is needed; the pattern is the recommendation. You could enrich it with a `VIEWED` relationship or `Category` nodes for content-based fallbacks. Common wrong answer: precomputing a co-purchase table — that's the relational workaround the graph traversal replaces, and it goes stale as purchases arrive.
What distinguishes a knowledge-graph “relationship” from a graph-theory “edge”?
Why
A relationship carries a type, a direction, and its own properties — it’s a data
record, not a bare connection, which is exactly why the KG vocabulary prefers the term. It does
not require both endpoints to share a label (relationships routinely connect different labels,
like Person→Movie). Relationships are directed (not undirected), and the difference is
substantive, not just a rebrand — the richness is what enables traversal-based retrieval.
Show answer
Correct: A relationship is a rich record with a type, direction, and its own properties
Explain the “identity through relationships” pattern and give the case where it breaks down.
Show answer
"Identity through relationships" means a node's role is encoded by the typed relationships it participates in, not by extra labels or a type field. A `Person` is an actor because it has an `ACTED_IN` relationship, a director because it has `DIRECTED` — the same node, two roles, no `Actor`/`Director` labels. This keeps the schema clean and avoids "label explosion" (a new label for every role). Where it breaks: when a role needs role-specific properties — an actor's per-film salary vs a director's budget — those must then live on the relationship, which can get unwieldy at scale, so sometimes a role does warrant its own modelling. Common wrong answer: "add an Actor label and a Director label" — that duplicates information the relationships already encode and multiplies labels.
A Person is both an actor and a director. Following “identity through relationships,” how do
you model this?
Why
The role comes from the relationships: one Person node that has both an ACTED_IN and a
DIRECTED relationship is, by that fact, both an actor and a director — no role labels needed.
Adding Actor/Director labels invites “label explosion”; splitting into two nodes
duplicates the same person; and a roles property throws away the connections (you’d lose
which movie they acted in or directed) that the relationships encode.
Show answer
Correct: Keep one Person node; its `ACTED_IN` and `DIRECTED` relationships define both roles
Why does a knowledge graph enhance RAG beyond vector similarity search?
Why
The graph adds a retrieval mode embeddings lack: traversal of typed relationships, which reaches entities connected to a result (a company’s investors, a chunk’s neighbouring sections) rather than merely similar text. It isn’t about compressing embeddings, it doesn’t swap semantic search for keyword matching, and it doesn’t remove the LLM — generation still happens; the graph just supplies richer, connection-aware context.
Show answer
Correct: It lets retrieval traverse typed relationships to reach connected entities
Which workload most favours a knowledge graph over a relational database?
Why
The graph wins on relationship complexity — multi-hop questions over interconnected entities, where each hop would be another SQL self-JOIN but is a cheap traversal in the graph. Tabular sales data with aggregations and primary-key lookups fits a relational database perfectly, and “semantically similar documents” is a vector-search task — neither needs graph traversal. Choose the KG for connection-heavy questions, not merely because relationships exist.
Show answer
Correct: Multi-hop questions over highly interconnected entities (who-supplies-whom, many hops)
Give a scenario where you would choose a knowledge graph over a relational database, and justify it on schema flexibility, relationship complexity, and query pattern.
Show answer
Decide on relationship complexity, not data size. For a fraud-detection system over accounts, transactions, devices, and shared addresses — where the valuable questions are multi-hop ("which accounts are linked through a shared device to a known-fraud account, within 3 hops?") — a knowledge graph wins: each hop is a cheap traversal, where the relational version is a pile of self-JOINs that explode with depth. Schema flexibility helps too (new link types appear as fraud evolves), and native vector+graph lets you combine similarity with traversal. Conversely, if the data were a flat ledger queried by account id with no multi-hop questions, a relational database would be simpler. Justify on: relationship complexity (multi-hop → graph), schema flexibility (evolving link types → graph), query pattern (traversal vs lookup). Common wrong answer: "it has 50M rows, so use a relational DB for scale" — size isn't the axis; relationship complexity is.
Explain why a knowledge graph enhances RAG beyond vector similarity search, using a “who invested in NetApp?”-style question to make the point concrete.
Show answer
Vector search retrieves text that is *semantically similar* to the query — great for "find me passages about NetApp's business." But the question "who invested in NetApp?" needs to *follow connections*: from the NetApp chunk to its filing, to the companies that filed ownership, to the investment firms. Embeddings can't traverse those typed relationships, so a similar-text match never reaches the investors. A knowledge graph adds traversal-based retrieval: retrieve a chunk by vector search, then walk relationships to expand context with connected entities. The two are complementary — similarity for "what's like this," traversal for "what's connected to this." Common wrong answer: "vector search can find the investors if the embeddings are good enough" — no embedding encodes a multi-hop ownership chain; that requires traversal.
Querying & Cypher
Which pattern finds Tom Hanks’s co-actors — people who acted in the same movie?
Why
Both actors point into the shared movie, so the arrows must converge on m: forward from
Tom (-[:ACTED_IN]->) and backward from the co-actor (<-[:ACTED_IN]-). The variant with both
arrows pointing the same way would require the movie to itself act in a movie; the one starting
(tom)<-[:ACTED_IN]-(m) says a movie acted in Tom (wrong direction); and the single-hop
(tom)-[:ACTED_IN]-(coActor) skips the shared movie entirely. The convergent two-hop through
the movie is the co-actor pattern.
Show answer
Correct: (tom)-[:ACTED_IN]->(m)<-[:ACTED_IN]-(coActor)
Explain how the LangChain Neo4j Graph class connects a Python application to Neo4j: the three parameters it needs and the method used to run Cypher.
Show answer
LangChain's `Neo4jGraph` class wraps a Neo4j driver. You construct it with three parameters — a connection `url` (the Neo4j URI), a `username`, and a `password` — loaded from environment variables or a `.env` file, never hard-coded. It then exposes a `query()` method: pass a Cypher string (optionally with a `params` dict for parameterised queries) and it returns the result as a list of Python dictionaries, handling connection pooling and authentication for you. That's the bridge from Python application code to the graph. Common wrong answer: "pass the credentials inline in the query string" — credentials belong in the connection (from the environment), not in Cypher.
A bulk-load pipeline retries a batch and re-runs a relationship insert. Which clause is correct, and why?
Why
MERGE checks for the relationship and creates it only if it’s missing, so re-processing the
same batch leaves the graph unchanged — the idempotency a retry-prone pipeline needs. CREATE
does skip the existence check, but that’s exactly why it duplicates on every retry; LIMIT
bounds returned rows, not writes, so it doesn’t prevent duplicate creation; and delete-then-
create is wasteful and races with concurrent readers. MERGE is the production default for
relationships.
Show answer
Correct: MERGE — it creates the relationship only if absent, so retries don't duplicate
You run a plain DELETE on a Person node that still has an ACTED_IN relationship. What
happens, and what’s the fix?
Why
A plain DELETE on a node that still has relationships fails — Neo4j won’t leave dangling
relationships, so it protects referential integrity. The fix is to delete the relationships
first, or use DETACH DELETE to remove the node and its relationships in one step. Neo4j does
not auto-remove the relationships, leave them orphaned, or soft-delete — it refuses the
operation outright.
Show answer
Correct: The DELETE fails; remove the relationships first, or use DETACH DELETE
You need to remove a Person node that still has relationships. Show how (and why a plain DELETE fails).
Show answer
A plain `DELETE` on a node that still has relationships fails, because Neo4j won't leave dangling relationships (referential integrity). Two fixes: delete the relationships first (`MATCH (p:Person {name: $name})-[r]-() DELETE r`, then delete the node), or — simpler — use `MATCH (p:Person {name: $name}) DETACH DELETE p`, which removes the node and all its relationships in one atomic step. Use DETACH DELETE whenever you're dropping a connected node. Common wrong answer: "DELETE removes the node and silently drops its edges" — it doesn't; it refuses the operation rather than orphan relationships.
What is the order of the four stages a Cypher query passes through?
Why
Parse → plan → execute → project. The string is first parsed into an AST; the optimiser
then plans (choosing indexes and scan order) before anything runs; the engine executes the
plan by traversing the graph; finally RETURN projects the requested fields. You can’t plan
before parsing, can’t execute before planning, and projection is last — it shapes the output of
a completed match. (EXPLAIN stops after planning; PROFILE runs through execution.)
Show answer
Correct: parse → plan → execute → project
Write a Cypher query that returns the name and birth year of every Person born before 1960. Name which clause does the filtering and why it can’t go inside the node’s braces.
Show answer
`MATCH (p:Person) WHERE p.born < 1960 RETURN p.name, p.born`. The `MATCH` binds every `Person` node, the `WHERE` filters to those born before 1960, and `RETURN` projects just the name and birth year (dot notation) rather than the whole node. A range like `< 1960` requires `WHERE`; inline `{born: 1960}` would only match the exact value 1960. Common wrong answer: putting the comparison inside the node braces, e.g. `(p:Person {born < 1960})` — inline property matching does equality only, not ranges.
Compare CREATE and MERGE for a graph-construction pipeline that may retry batches. Explain why MERGE gives idempotency and the one thing that can still make MERGE duplicate.
Show answer
CREATE inserts unconditionally; MERGE checks whether the element already exists and creates it only if absent (MATCH-then-create in one atomic step). In a pipeline that may re-process a batch — API retries, replays, duplicate messages — CREATE produces duplicate nodes/ relationships on each run (or errors against a uniqueness constraint), whereas MERGE leaves the graph unchanged on re-runs, giving idempotent, exactly-once semantics. The full pattern: a uniqueness constraint on the key + MERGE on that stable identifier + `ON CREATE SET` (initial properties) and `ON MATCH SET` (volatile updates like a timestamp). Crucially, MERGE on a *volatile* property creates a new element each run, so always MERGE on the stable id. Common wrong answer: "use CREATE and dedupe later" — that lets duplicates into the graph and pushes the cost downstream; MERGE prevents them at write time.
Design a Cypher query that finds every actor who appeared in a movie directed by a given person. Explain the arrow directions and how you’d keep the result set bounded.
Show answer
Two relationship hops share the movie node: `MATCH (d:Person {name: $director})-[:DIRECTED]-> (m:Movie)<-[:ACTED_IN]-(actor:Person) RETURN actor.name, m.title`. The `DIRECTED` arrow points forward from the director into the movie; the `ACTED_IN` arrow points backward from the actor into the same movie — the shared `Movie` is the join. To bound the result on a high-degree director, add a `LIMIT` or filter (e.g. by `released` year), since a director with many films each with many actors can return a large cross-product. Common wrong answer: pointing both arrows the same direction — then the pattern asks for an actor the *movie* acted in, which has no match.
Why should Cypher variable names be meaningful, given the engine runs the query identically regardless of the names?
Show answer
The engine ignores variable names — it runs `(n1)-[:ACTED_IN]->(n2)` exactly as it runs `(tom)-[:ACTED_IN]->(movie)` — so meaningful names cost nothing at runtime. Their entire value is for humans: a pattern named for what each variable holds (`tom`, `coActor`, `m` for Movie) reads as its own documentation, while `n1/n2/n3` force the reader to reconstruct the intent, and the cost compounds in multi-hop queries. Readable Cypher is a maintainability investment paid back every time someone (often you, months later) revisits the query. Common wrong answer: "meaningful names make the query run faster" — names have no effect on execution; the benefit is purely readability.
How does LangChain’s Neo4jGraph class let a Python application run Cypher?
Why
Neo4jGraph wraps a Neo4j connection built from a url, username, and password (loaded from the
environment) and exposes a query() method that sends a Cypher string and returns the result
as Python dictionaries. It does not transpile Cypher to SQL (it talks to Neo4j directly), does
not embed an in-memory graph (it connects to a real database), and does not author Cypher from
natural language — that’s Module 7’s LLM-generated-Cypher topic, a separate capability.
Show answer
Correct: It wraps a connection (url/user/password), exposing query() for Cypher
Rewrite this query for readability and explain what improved: MATCH (a)-[:ACTED_IN]->(b)<-[:DIRECTED]-(c) RETURN a.name, c.name.
Show answer
Rewrite `MATCH (a)-[:ACTED_IN]->(b)<-[:DIRECTED]-(c) RETURN a.name, c.name` with names that say what each variable is: `MATCH (actor:Person)-[:ACTED_IN]->(movie:Movie)<-[:DIRECTED]-(director: Person) RETURN actor.name, director.name`. Adding the `Person`/`Movie` labels also lets the planner narrow the search, and the structure now reads as "an actor's movie, also directed by a director." Same execution, far clearer intent — and the labels can improve the plan. Common wrong answer: leaving the single-letter variables and adding a comment instead — a comment drifts out of date, whereas self-describing variable names stay correct because they're part of the query.
Trace the execution of MATCH (p:Person)-[:ACTED_IN]->(m:Movie) RETURN p.name, m.title LIMIT 5
through the four stages. Say what the engine does at each.
Show answer
For `MATCH (p:Person)-[:ACTED_IN]->(m:Movie) RETURN p.name, m.title LIMIT 5`: Parse — the string becomes an AST capturing the pattern (a `Person` linked by `ACTED_IN` to a `Movie`) and the projection. Plan — the optimiser uses the `Person`/`Movie` labels (and any index) to choose a starting point and an expansion strategy. Execute — the engine finds `Person` nodes, traverses their outgoing `ACTED_IN` relationships to `Movie` nodes, binding `p` and `m`, and stops early once 5 matches satisfy `LIMIT`. Project — it returns only `p.name` and `m.title` for those 5 rows. The flow is pattern-spec → match → projection, with the plan deciding *how* the match is done. Common wrong answer: "it scans every Person then filters at the end" — the planner uses labels/indexes and the `LIMIT` to avoid scanning the whole graph.
Which Cypher query returns all movies released after 2000?
Why
A range comparison needs a WHERE clause: MATCH (m:Movie) WHERE m.released > 2000 RETURN m.
Inline {...} property matching only does equality, so {released > 2000} isn’t valid
syntax. Cypher has no RETURN ... IF construct, and the pattern clause is MATCH, not
FILTER. Use inline {} for equality and WHERE for ranges and boolean logic.
Show answer
Correct: MATCH (m:Movie) WHERE m.released > 2000 RETURN m
Preparing Text for RAG
What is the defining advantage of storing embeddings inside Neo4j rather than in a separate vector database?
Why
Co-location means embeddings live as node properties in the same store as the graph, so a single Cypher query can do vector search and traverse relationships — retrieve a chunk, then follow its connections — in one round trip, with one system to run. It doesn’t make embedding computation faster (that’s still the API), doesn’t remove the need for an embedding model, and doesn’t make the similarity math more accurate — the win is the combined query.
Show answer
Correct: One query can combine vector similarity with graph traversal in a single round trip
Your embeddings are 1536-dimensional (OpenAI) but the vector index was created with 768 dimensions. What happens, and why is cosine the recommended similarity function?
Why
The index dimension must equal the embedding dimension, so a 768-dim index can’t match 1536-dim vectors — the query fails with a dimension-mismatch error (there’s no automatic projection or padding). Cosine is recommended because it compares vector direction and normalises away magnitude, which is the right notion of similarity for embeddings (and the choice OpenAI documents). It isn’t a speed-vs-accuracy trade, and the index never silently changes its configured function.
Show answer
Correct: The query errors on the dimension mismatch; cosine is recommended because it normalises away magnitude
Why is cosine similarity recommended for OpenAI embeddings, what does a score of ~0.89 mean, and why must the index dimension match the model?
Show answer
Cosine similarity measures the angle between two vectors and ignores their magnitude ($\cos\theta = (A·B)/(\lVert A\rVert\lVert B\rVert)$), so it captures *direction* — the right notion of semantic closeness for embeddings — and is OpenAI's recommended function; for normalised vectors it also ranks identically to Euclidean. A cosine score of ~0.89 on normalised embeddings is a strong semantic match (with ~0.75 moderate, below 0.5 weak), though scores are relative to the model and corpus. And the index's declared dimension must equal the model's output (1536 for OpenAI) — a mismatch makes the query fail with a dimension error. Common wrong answer: "a higher dimension always means better search" — more dimensions cost more storage/time and only help if the model actually uses them well.
Write the Cypher to create a vector index on a Chunk node’s textEmbedding property (1536
dimensions, cosine), and say which status from SHOW INDEXES means it’s ready to query.
Show answer
`CREATE VECTOR INDEX chunk_embeddings IF NOT EXISTS FOR (c:Chunk) ON (c.textEmbedding) OPTIONS {indexConfig: {`vector.dimensions`: 1536, `vector.similarity_function`: 'cosine'}}`. The dimension (1536) must match the embedding model's output exactly, and `cosine` is the recommended similarity function for OpenAI embeddings. `IF NOT EXISTS` makes creation idempotent. It is ready to query when `SHOW INDEXES` reports its status as **ONLINE** — while it is `POPULATING`, searches return incomplete or empty results. Common wrong answer: querying immediately after creating it — you must wait for ONLINE, or the results are wrong.
Describe the four stages of a Neo4j vector similarity search (encode → search → score → return) and the one consistency rule that makes the results meaningful.
Show answer
Four stages: encode — turn the user's question into a vector with the same embedding model that produced the stored vectors; search — `db.index.vector.queryNodes` finds the top-K nodes whose vectors are nearest the query vector; score — each result carries a cosine similarity (0–1), higher = closer; return — the matched nodes' text and metadata become the LLM's context. The one consistency rule that makes it work: query and corpus must be embedded by the *same* model into the *same* space. Parameterise the question, key, and topK so the plan caches and secrets stay out of the string. Common wrong answer: embedding the query with a different model than the stored text — the vectors then live in different spaces and the scores are noise.
Write a Cypher query that generates an OpenAI embedding for each Chunk node’s text and
stores it on the node. Name the two procedures involved and where the API key goes.
Show answer
Match the nodes, encode their text, and store the vector on each node: `MATCH (c:Chunk) WHERE c.text IS NOT NULL WITH c, genai.vector.encode(c.text, "OpenAI", {token: $openAiApiKey}) AS v CALL db.create.setNodeVectorProperty(c, "textEmbedding", v)`. The `WHERE ... IS NOT NULL` skips chunks with no text; `genai.vector.encode` returns the embedding (text, provider, config with the token as a parameter); `setNodeVectorProperty` writes it onto the node so it's co-located with the graph. Pass `$openAiApiKey` via `params`, never in the string. Then verify with `size(c.textEmbedding)` = 1536. Common wrong answer: returning the vector to Python and writing it back in a second round trip — `setNodeVectorProperty` stores it in the same query.
What does Neo4j’s genai.vector.encode function do?
Why
genai.vector.encode calls an external provider (e.g. OpenAI) from within Cypher, taking the
text, provider name, and a config map (with the API token), and returns the embedding as a
float array — which you then store with db.create.setNodeVectorProperty. Creating the index is
CREATE VECTOR INDEX; the search is db.index.vector.queryNodes; and it calls a hosted model,
it doesn’t train one. It is the generate-an-embedding step, not the index or the search.
Show answer
Correct: It calls an external embedding API from inside Cypher, returning the embedding
Rewrite this insecure call to use a query parameter, and name two benefits: kg.query(f'... {{ token: "{api_key}" }} ...').
Show answer
Rewrite `kg.query(f'... {{ token: "{api_key}" }} ...')` as `kg.query("... { token: $openAiApiKey } ...", params={"openAiApiKey": api_key})`. The f-string version interpolates the secret into the query text, where it can leak into logs, error messages, and the query cache (and opens an injection hole for any user-derived value). The parameterised version keeps the key out of the string entirely. Two benefits: security (no leakage, no injection) and performance (Neo4j caches the execution plan for the parameterised query and reuses it across calls). Caveat: DDL like `CREATE INDEX` can't take parameters, so sanitise those manually. Common wrong answer: "log the query for debugging, key and all" — that's exactly how keys leak.
Why pass an API key into a Cypher query as a parameter ($openAiApiKey) rather than an f-string?
Why
A parameter keeps the secret out of the query string — so it can’t leak into logs, error messages, or the query cache, and there’s no injection surface — and it lets Neo4j reuse a cached execution plan across calls. Typing speed is irrelevant; you should not store API keys in the graph; and you technically can build a query string with interpolation (that’s exactly the unsafe thing to avoid), so it isn’t that Cypher forbids it.
Show answer
Correct: Parameters keep the key out of the query string and let Neo4j cache the plan
A cosine similarity search returns a flat band of scores (0.78–0.82) with no clear winner. What does this most likely indicate?
Why
A flat band means no single stored item stands out — usually an ambiguous query or a corpus without a strong match — so you read the distribution (a clear winner like 0.92 vs 0.75 would signal confidence) and respond with a threshold or hybrid search. It is not evidence of a corrupt index or broken cosine math (those tend to give zero/garbage results, not a tidy band), and a flat band is the opposite of high-quality discrimination.
Show answer
Correct: The query is ambiguous or the corpus lacks a strong match
A similarity search for “renewable energy initiatives” returns five chunks scored 0.91, 0.88, 0.84, 0.82, 0.79 on related-but-varying topics. Which are relevant, what threshold would you set, and how would you raise precision without changing the embedding model?
Show answer
Read relevance from the score gap, not absolute values. If a "renewable energy" search returns solar 0.91, wind 0.88, carbon-credit 0.84, then employee-volunteering 0.82 and recycling 0.79, the top three are squarely on-topic while the bottom two are corporate-responsibility near-misses — semantically adjacent but not energy. A threshold around 0.83 keeps the genuine matches (0.84 and up) and drops the rest; tune it by need (≈0.78 for high recall, ≈0.88+ for high precision), and calibrate against *this* corpus since scores are relative. To improve precision without changing the embedding model: add a cross-encoder re-rank, chunk more granularly, pre-filter by metadata, or go hybrid (vector + keyword). Common wrong answer: applying a fixed threshold like 0.8 across corpora — a good cutoff for taglines may be wrong for filings.
Write a complete vector similarity-search query: encode a user $question, search a
chunk_embeddings index for the top-K matches, and return the chunk text with scores.
Show answer
Encode the question with the *same* model, search the index, and project text + score: `WITH genai.vector.encode($question, "OpenAI", {token: $openAiApiKey}) AS q CALL db.index.vector.queryNodes('chunk_embeddings', $topK, q) YIELD node, score RETURN node.text, score ORDER BY score DESC`. `queryNodes` takes the index name, a top-K count, and the query vector, and yields each matched node with its similarity score. All inputs — `$question`, `$openAiApiKey`, `$topK` — are query parameters. Crucially the question must be embedded with the same model used for the stored vectors, or the comparison is meaningless. Common wrong answer: hard-coding topK or the question in the string instead of parameterising them.
Which statement about creating and using a Neo4j vector index is correct?
Why
Two hard requirements: the index’s declared dimension must equal the embedding model’s output
(1536 for OpenAI), and the index must reach ONLINE status — a POPULATING index returns
incomplete or empty results. Neo4j does not pad or truncate to reconcile a dimension mismatch
(a mismatch raises a dimension error at query time), status very much matters, and a vector index does require a
similarity function (e.g. cosine) in its config.
Show answer
Correct: The dimension must match the model's output exactly, and the index must be ONLINE before querying
When should you use Neo4j as your vector store versus a dedicated vector database alongside the graph? Give the deciding factor.
Show answer
Co-locate in Neo4j when your retrieval genuinely *combines* similarity with traversal — the whole graph-RAG thesis: one query finds the nearest chunk and follows its relationships to pull in connected context, in a single round trip and one system to operate. Choose a dedicated vector DB (Pinecone, Weaviate, Milvus) alongside the graph when the vector workload is large and standalone and you need advanced ANN algorithms (IVF-PQ, ScaNN), frequent embedding-model swaps, or independent scaling of the vector tier — capabilities a single-engine setup may not match. The deciding question is whether the *queries* mix vector and graph, not whether the data merely has both. Common wrong answer: "always co-locate, fewer systems is always better" — fewer systems is simpler, but a heavy standalone vector workload can outgrow Neo4j's ANN support.
KG Construction
Which sections of an SEC Form 10-K are used to construct the knowledge graph, and what does each contain?
Show answer
Four sections drive 10-K knowledge-graph construction: Item 1 (business description), Item 1A (risk factors), Item 7 (management's discussion and analysis), and Item 7A (quantitative and qualitative market-risk disclosures). They are extracted from the raw XML into JSON along with the CIK and CUSIP identifiers. Common wrong answer: naming the financial statements / balance sheet — those are the numeric tables (largely Item 8); the course builds the graph from the narrative text sections, which embed well for semantic retrieval.
A 10-K section is 50,000 characters. Using chunk_size=2000 and chunk_overlap=200, about how
many chunks result? Show the reasoning, not just the number.
Show answer
Each chunk advances by chunk_size minus overlap, so the count is roughly the total length divided by that effective advance. For a 50,000-character section at chunk_size 2000, overlap 200, the advance is 2000 − 200 = 1800, giving ceil(50,000 / 1800) ≈ 28 chunks. The formula is ceil((L − O) / (S − O)) for length L, size S, overlap O. Common wrong answer: dividing by the full chunk size (50,000 / 2000 = 25) — that ignores overlap, which shrinks the advance and so slightly raises the count.
You raise the chunk size from 2,000 to 8,000 characters. What is the most likely effect on retrieval quality?
Why
A larger chunk blends more topics into one vector, so the embedding becomes generic and matches specific queries less precisely — the classic size/precision tradeoff. Bigger chunks reduce the count (fewer, larger pieces), so embedding cost falls rather than rises; more context does not reliably improve precision here; and chunk size does not push all cosine scores uniformly higher.
Show answer
Correct: Embeddings get more generic, so retrieval loses precision on specific queries
Why is chunkSeqId stored in each chunk’s metadata?
Why
chunkSeqId preserves a chunk’s position within its section, so Module 5 can wire chunks into
a NEXT linked list for adjacent-context retrieval. The unique key MERGE dedupes on is
chunkId, not chunkSeqId; similarity scores are computed at query time, not stored; and the
filing/company are captured by formId/cik, not the sequence id.
Show answer
Correct: It records each chunk's order within its section, enabling NEXT links in later modules
In an SEC Form 10-K pipeline, what role do CIK and CUSIP play?
Why
CIK (Central Index Key) and CUSIP are identifiers: the CIK keys a company across all its SEC filings, and the CUSIP (first 6 digits) identifies the issuer across datasets — so they become the linking keys when the graph gains relationships. They are not text sections (those are Item 1/1A/7/7A), not vector-index parameters (dimensions and similarity function), and not LangChain chain types.
Show answer
Correct: Company and security identifiers that join a filing to the same company across datasets
Why create a uniqueness constraint on Chunk.chunkId before bulk-loading with MERGE?
Why
The constraint does two jobs: it guarantees no duplicate chunkId, and it creates an implicit
index so each MERGE existence check is an index lookup rather than a full scan of every Chunk
node — turning quadratic loading into roughly linear. It has nothing to do with encryption or
score normalisation, and ON CREATE SET (not the constraint) is what limits property writes to
first creation.
Show answer
Correct: It blocks duplicate chunks and adds an index, so each MERGE is a lookup not a scan
You have already embedded and indexed your chunks in Neo4j. Which Neo4jVector constructor
connects to them, and what currency caveat applies to its import?
Show answer
Use `from_existing_graph` when the chunks are already embedded and indexed in Neo4j (as in this pipeline) — it attaches to that index instead of re-embedding. Currency caveat: the import moved to the dedicated langchain-neo4j package (2024-11), so the current import is `from langchain_neo4j import Neo4jVector`, not `from langchain_community.vectorstores import Neo4jVector`. The connect/retrieve/generate behaviour is unchanged; only the package path differs. Common wrong answer: re-running an embedding constructor (from_documents), which would redundantly re-embed text you already indexed.
A RAG system over only NetApp’s 10-K answers “What is Apple’s revenue?” with a confident, specific figure attributed to Apple. Identify what went wrong, the root cause, and how to prevent it.
Show answer
Diagnosis: entity mis-attribution plus invented facts. The graph holds only NetApp's filing, so a query about Apple retrieves NetApp's nearest chunks (highest cosine among what exists), and the LLM applies that context to Apple and fabricates a revenue figure. Root cause: no entity-alignment check and no refusal instruction in the prompt. Fix: a prompt that says "use ONLY this context", names the source companies, and instructs "if the context is not about the company in the question, say you don't have that data" — plus retrieval-score gating so out-of-scope questions return no context at all. Note the limit: prompt engineering reduces but does not eliminate hallucination, so pair it with a score threshold.
How do you load chunk nodes so the pipeline can be re-run safely after a failure without creating duplicates? Name the pieces and sketch the Cypher.
Show answer
Idempotency means re-running the pipeline yields the same graph, which retry-prone ingestion needs. Three pieces deliver it: a uniqueness constraint on the stable key (no duplicates, plus an implicit index), MERGE on that key (match-or-create instead of always insert), and ON CREATE SET to write properties only on first creation. Concretely: `MERGE (c:Chunk {chunkId: $p.chunkId}) ON CREATE SET c.text = $p.text, c.formId = $p.formId`, behind `CREATE CONSTRAINT ... FOR (c:Chunk) REQUIRE c.chunkId IS UNIQUE`. Common wrong answer: using CREATE — it inserts a new node every run, so a retry duplicates every chunk.
A teammate’s loader uses MERGE (c:Chunk {chunkId: $id, loadedAt: timestamp()}) and the graph
keeps growing duplicate chunks on every run. Diagnose the bug and fix it.
Show answer
The bug is MERGE on a volatile property. `MERGE (c:Chunk {chunkId: $id, loadedAt: timestamp()})` includes a timestamp in the match pattern, so every run produces a different pattern, never matches the existing node, and creates a duplicate — silently corrupting the graph at scale. Fix: MERGE on the stable identifier only, then write volatile fields in ON CREATE SET / ON MATCH SET: `MERGE (c:Chunk {chunkId: $id}) ON CREATE SET c.loadedAt = timestamp() ON MATCH SET c.updatedAt = timestamp()`. Key rule: MERGE's match pattern must contain only stable keys. Common wrong answer: blaming MERGE itself or a missing constraint — MERGE is correct; the bug is putting a volatile value inside its match pattern.
What metadata is stored with each chunk, and why store it rather than just the text and its embedding?
Show answer
Each chunk record stores its text plus metadata that preserves origin, position, and identity: item (which section), chunkSeqId (position within the section), formId (the filing it belongs to), chunkId (its unique key), source (a link back to the SEC filing), and the entity ids cik and cusip6. The point is provenance — you can rebuild document structure and link chunks to entities later only if you captured these at ingest. Common wrong answer: storing just the text and embedding; that loses order and origin, so you can never reconstruct NEXT order or trace a chunk to its company.
What does Neo4jVector.from_existing_graph need in order to connect LangChain to an existing
Neo4j vector index? Name the key parameters.
Show answer
`Neo4jVector.from_existing_graph` connects LangChain to an index you already built, so it needs to know where the vectors live: the index_name (the vector index, e.g. "form10kChunks"), the node_label (which nodes, e.g. "Chunk"), the text_node_properties (which property holds the text, e.g. ["text"]), and the embedding_node_property (where the vector is stored, e.g. "textEmbedding") — plus the connection url/username/password and an embeddings object. Common wrong answer: expecting it to create the index or generate embeddings; from_existing_graph attaches to a pre-built index, it does not build one.
A RAG system holding only NetApp’s filing answers a question about Apple using NetApp’s data. Which fix most directly addresses this?
Why
This is entity mis-attribution: the retriever returned NetApp’s nearest chunks and the prompt never told the model to decline. The direct fix is a prompt that restricts the model to the retrieved context and refuses when the entity is missing (“say you don’t know”), ideally paired with retrieval-score gating. Bigger chunks, higher dimensionality, and a different similarity function change retrieval mechanics but none of them stop the model from answering out-of-scope.
Show answer
Correct: Instruct the prompt to use only the retrieved context and refuse missing entities
After constructing the graph in this chapter, Neo4j holds embedded chunks but no relationships. Name three things this pure vector store cannot do, and what fixes them.
Show answer
After this chapter the chunks are embedded but disconnected — Neo4j is functionally just a vector store. So you cannot follow a chunk back to its source filing, cannot retrieve the adjacent chunk for expanded context, and cannot traverse from a chunk to related entities (the company that filed it, other filings, investors). All three need relationships, which Module 5 adds as NEXT (chunk order), PART_OF (chunk to form), and SECTION links. Common wrong answer: "similarity search is broken" — vector search works fine; what is missing is *connection*, not retrieval.
Describe the three stages of the LangChain retrieval-QA chain over Neo4j, in order.
Show answer
Three stages: connect, retrieve, generate. Connect — Neo4jVector wraps the graph as a standard vector store, running Cypher similarity search under the hood. Retrieve — the question is encoded with the same embedding model and matched against the chunk embeddings; the top-K nearest chunks come back. Generate — those chunks are stuffed into the LLM prompt as context, and the LLM answers grounded in them. Common wrong answer: putting generation before retrieval, or skipping retrieval — without the retrieved context the LLM is just answering from its parameters, which is the non-RAG baseline.
In a "stuff" retrieval-QA chain, what does the retriever hand to the LLM?
Why
The "stuff" chain type “stuffs” the top-K retrieved chunks straight into the prompt as context,
and the LLM answers from them. It does not reduce to a single chunk (that would be re-ranking),
it does not pass Cypher (LLM-generated Cypher is Module 7’s approach), and it never hands over raw
vectors — the LLM reads text, not embeddings.
Show answer
Correct: The top-K retrieved chunks, concatenated into the prompt as context
When is a pure vector store the right choice and when should you invest in graph relationships? Give the deciding factor, not just a preference.
Show answer
The deciding question is whether the *queries* combine similarity with traversal, not whether the data has latent relationships. If retrieval is answered by semantic similarity alone — FAQ lookup, "find passages like this" — a pure vector store is simpler to operate and the graph adds overhead you never cash in. Add graph structure when retrieval needs connection: adjacent chunks, the parent document, the filing company, multi-hop context. This mirrors the Module 3 vector-DB rule (co-locate when queries mix vector and graph). Common wrong answer: "always use a graph because the data is connected" — relationships in the data only pay off if the queries actually traverse them.
Adding Relationships
What is a chunk window, and why does it improve RAG answers without changing the embedding model?
Show answer
A chunk window returns the best-matching chunk plus its neighbours in the NEXT linked list, so the LLM sees continuous context instead of one isolated fragment. It matters because a single chunk often holds only part of an answer — a fact or detail can sit in the adjacent chunk that vector search didn't rank first. Expanding to a window (default 3 chunks, w = 2k + 1) recovers that surrounding context and improves answer completeness, all without changing the embedding model or re-running the search. Common wrong answer: "it improves the similarity score" — the window doesn't change scores; it changes how much context reaches the LLM.
What is a custom retrieval query in Neo4jVector, and how does it let one retrieval step combine vector search with graph traversal?
Show answer
A custom retrieval query is Cypher you pass to Neo4jVector through the `retrieval_query` argument. By default the retriever returns the matched chunk's text verbatim; the custom query runs after the vector match and can traverse the graph first — here following NEXT both directions to assemble a chunk window — before returning text to the LLM. That is exactly how vector search and graph traversal fuse into one retrieval step: the index finds the entry chunk, the Cypher tail expands it. Common wrong answer: thinking it replaces the vector search — it extends it, receiving the vector hit as input.
A custom retrieval query receives two variables from the vector index — name them — and sketch how it returns a chunk window to the LLM.
Show answer
The custom query receives two values from the vector search: `node` (the matched Chunk) and `score` (its similarity). It typically starts `WITH node, score`, then matches a window such as `(:Chunk)-[:NEXT*0..1]->(node)-[:NEXT*0..1]->(:Chunk)`, keeps the longest with `ORDER BY length(window) DESC LIMIT 1`, unwinds the window's nodes, and returns each chunk's text, the score, and a metadata map. The shape the chain expects back is text, score, metadata. Common wrong answer: returning only the single node's text — that throws away the window you just traversed.
A 10-chunk document (Introduction seq 0–4, Methods seq 0–4, sequence ids restarting per section) was loaded with NEXT but no section filter. How many edges should exist, what does the bug add, and how would you detect the bad edges?
Show answer
Under per-section numbering (Introduction seq 0–4, Methods seq 0–4), the correct build with the `item` filter gives 4 edges per section = 8 total. Without the filter, the `chunkSeqId + 1` join matches by number across the whole form, so each chunk also links to the next-numbered chunk in the other section (Introduction seq-3 → Methods seq-4, and so on) — spurious cross-section edges that fork the list. Detect them by querying for NEXT edges whose endpoints sit in different sections: `MATCH (a)-[:NEXT]->(b) WHERE a.item <> b.item RETURN a, b` — any rows are boundary bleed. Fix by deleting those edges and rebuilding with the `item` filter. Common wrong answer: assuming a single stray edge (one "last-of-A → first-of-B" link) — the unfiltered join adds a cross-section edge at every sequence number that lines up, not just one.
Trace the traversal from an arbitrary chunk to the first chunk of its own section. Which relationships do you follow, and in which direction?
Show answer
Hop up, then down. From the chunk, follow PART_OF to its Form (`(c:Chunk)-[:PART_OF]->(f:Form)`), then follow SECTION from the form, filtered to the chunk's own section, down to that section's first chunk (`(f)-[s:SECTION]->(first:Chunk) WHERE s.f10kItem = c.item`). The full pattern is `(c:Chunk)-[:PART_OF]->(:Form)-[:SECTION]->(:Chunk)` with the SECTION's f10kItem matching the starting chunk's item. This is the two-hop trace the module's last objective asks for: chunk → form → section entry point. Common wrong answer: walking NEXT backward — it works only if no edge is missing and is O(n) in section length, whereas SECTION jumps in one hop.
Write the Cypher that creates NEXT relationships between sequential chunks within a single form section. Which clauses keep the linked list from crossing section boundaries?
Show answer
Match two chunks in the same section whose sequence ids differ by one, then MERGE a NEXT edge. The pattern is: match `(c1:Chunk), (c2:Chunk)` where `c1.formId = c2.formId`, `c1.item = c2.item`, and `c2.chunkSeqId = c1.chunkSeqId + 1`, then `MERGE (c1)-[:NEXT]->(c2)`. The two equality filters on formId and item are what keep the list inside one section; MERGE (not CREATE) makes it idempotent so re-running adds no duplicates. Common wrong answer: ordering only by chunkSeqId without the item filter — that links the last chunk of one section to the first of the next.
A NEXT relationship is created between two chunks when which condition holds?
Why
NEXT links consecutive chunks within one section: same formId, same item, and chunkSeqId
differing by exactly one. Matching across different sections is the bug the section filter exists
to prevent; NEXT is about document order, not cosine similarity; and it stays within a single form,
not across a company’s filings.
Show answer
Correct: They share a formId and item and their chunkSeqId differ by one
Design a graph schema for technical documentation (books contain chapters contain sections, each section has embedded chunks). Give the node and relationship types, and say which relationship serves each of: sequential reading, section jump, chapter jump, and similarity search.
Show answer
Reuse the SEC pattern. Node types: Book, Chapter, Section, Chunk (with text + embedding). Membership runs upward with PART_OF (Chunk to Section, Section to Chapter, Chapter to Book), and order runs along NEXT (Chunk to Chunk within a section). For direct jumps, add entry-point relationships — Book to Chapter and Chapter to Section — so a reader can jump by name without walking the whole hierarchy. Each navigation pattern maps to one mechanism: sequential reading follows NEXT, section/chapter jumps follow the entry-point edges, upward context follows PART_OF, and similarity uses the vector index on Chunk embeddings then PART_OF for context. Key design call: which entry-point edges to materialise — they trade a little storage for direct navigation.
Why is filtering by section (not just by form) necessary when creating NEXT relationships, and what concretely breaks if you omit it?
Show answer
Section filtering is a correctness requirement because chunk sequence ids are assigned within a section (they restart at 0), and NEXT links chunks whose ids differ by one. Filter only on formId and the `chunkSeqId + 1` join matches by number across the whole form — so a chunk links to the next-numbered chunk in every section, not just its own, forking the linked list across section boundaries. A chunk window built at a boundary then pulls in unrelated context from an adjacent section, degrading answer quality. The fix: require matching formId AND item on both chunks when creating NEXT. Common wrong answer: picturing the bug as one stray edge — the unfiltered join adds a cross-section edge wherever sequence numbers line up.
In a chunk-window query, what does *0..1 in -[:NEXT*0..1]-> do?
Why
*0..1 is a variable-length path matching zero or one NEXT hop. The “zero” is the point: at
the first or last chunk the missing side matches no hop, so the pattern still returns the
available neighbours instead of failing. It is not a fixed single hop (that is the pattern that
breaks at boundaries), not a 0–10 range, and not a score filter.
Show answer
Correct: It matches zero or one NEXT hop, so the window still works at list boundaries
Which option correctly maps the three relationship types added in Module 5 to their roles?
Why
NEXT orders chunks within a section (the linked list), PART_OF points a chunk up to its parent
form (document membership), and SECTION points a form down to the first chunk of each section
(the entry point, tagged f10kItem). The near-miss differs only in the SECTION clause — SECTION
links to a section’s first chunk, not to a company; companies don’t enter until Module 6.
Show answer
Correct: NEXT orders chunks; PART_OF joins a chunk to its form; SECTION links a form to each section's first chunk
You hold a chunk in the middle of a section and want that section’s first chunk. Which traversal gets you there?
Why
The chunk reaches its section’s entry point by hopping up to the form with PART_OF, then down with
SECTION (which points at the seq-0 chunk): (:Chunk)-[:PART_OF]->(:Form)-[:SECTION]->(:Chunk).
Following NEXT forward lands on the last chunk, not the first; PART_OF goes to a form not a
company; and a vector search on the title is neither reliable nor structural.
Show answer
Correct: Chunk to Form via PART_OF, then Form to the section's first chunk via SECTION
Graph-augmented retrieval beat pure vector retrieval on quality but cost more per query. Give the deciding factor for choosing each, and design a hybrid that captures most of the gain at lower cost.
Show answer
Graph augmentation bought a real quality gain in the support-RAG comparison — roughly 78 to 89 percent accuracy and 62 to 84 percent completeness — but at higher latency and about double the monthly cost. The deciding factor is whether questions actually span chunk or document boundaries and whether answer quality drives value; for self-contained FAQ lookups the pure vector path is enough. The hybrid captures most of the gain cheaply: run vector search first, then trigger the graph window only on a cheap signal — a low top similarity score or a multi-entity question — so the expensive traversal fires on the ~30 to 40 percent of queries that need it, keeping average cost near the vector baseline. Common wrong answer: "always add the graph because accuracy is higher" — you pay for traversal you don't always use.
What does a variable-length path like *0..1 match, and why is it needed at the start and end of a
chunk linked list?
Show answer
A variable-length path matches a *range* of hops instead of a fixed count: `*0..1` matches zero or one NEXT hop, `*0..2` zero to two, `*1..3` one to three. It is needed at linked-list boundaries because a fixed two-hop window (before to middle to after) has nothing to match at the first chunk (no predecessor) or last chunk (no successor), so it returns zero rows. With `*0..1` the missing side matches zero hops — the endpoint collapses onto the target — so the query still returns the neighbours that do exist. Pair it with `ORDER BY length(window) DESC LIMIT 1` to keep the longest window available at each position. Common wrong answer: reading `*0..1` as "exactly one hop" — the zero is the point, and it is what lets the pattern survive at the list ends.
In the NetApp example, what did adding a chunk window reveal that single-chunk retrieval missed?
Why
The single best-matching chunk gave the high-level answer (“enterprise storage and data management”); the window followed NEXT into the adjacent chunk and surfaced the specific Keystone product detail. Vector search found the entry point, graph traversal added the surrounding context — the window does not change the score, add a model, or speed anything up.
Show answer
Correct: The Keystone product detail held in an adjacent chunk
Compare the RAG answer for “What is NetApp’s primary business?” with and without a chunk window. What specifically does the window add, and why?
Show answer
Without a window, retrieval returns the single best-matching chunk, so the answer is whatever that one fragment holds — for "What is NetApp's primary business?" that was "enterprise storage and data management." With a window, the query follows NEXT to the adjacent chunks and the LLM also sees the Keystone product detail that lived next door, producing a more complete answer. The lesson generalises: vector search picks the entry point, graph traversal supplies the surrounding context that a lone chunk omits. Common wrong answer: "the windowed result is more accurate because the score is higher" — accuracy/completeness rises from added context, not a changed score.
A chunk window takes k hops in each direction. For k = 2, how many chunks does a full
(non-boundary) window contain?
Why
The window size is w = 2k + 1: the target plus k neighbours on each side. For k = 2 that is
2(2) + 1 = 5 chunks. The “4” answer forgets to count the target itself, “3” is the k = 1 default,
and “3k + 1” is not the window formula.
Show answer
Correct: 5 chunks — the target plus two on each side
Expanding the Graph
Which property is the unique identifier when creating Company nodes?
Why
Company nodes are MERGE’d on cusip6 — the issuer identifier that both datasets carry, which
is exactly why it can bridge Company to Form. managerCik keys Manager nodes, companyName is an
unstable display string (poor key), and formId identifies a filing, not a company.
Show answer
Correct: cusip6, the six-digit company identifier
List the complete graph schema after Module 6: all node types, all relationship types, and the indexes.
Show answer
Four node types — Chunk (text + embedding), Form (the 10-K), Company, Manager. Five relationship types — Chunk NEXT Chunk (linked list), Chunk PART_OF Form (membership), Form SECTION Chunk (section entry), Company FILED Form (filing, via CUSIP), Manager OWNS_STOCK_IN Company (investment, with shares/value/quarter). Two indexes — a vector index on Chunk.textEmbedding and a full-text index on Manager.managerName. Together these support semantic search (vector), keyword search (full-text), sequential and hierarchical document navigation (NEXT/PART_OF/ SECTION), and cross-entity traversal (FILED/OWNS_STOCK_IN). Common wrong answer: forgetting the indexes — they are part of the schema's capability, not an afterthought.
Automatic CUSIP-based linking between Company and Form nodes would fail in which situation?
Why
Linking on a shared key works only if both datasets actually share that key. If one identifies companies by CUSIP and the other by ISIN or ticker, there is no common value to MERGE on, so you need an entity-resolution layer. Shared CUSIPs are exactly what makes linking work; node creation order is irrelevant to MERGE; and a manager holding several companies is normal, not a failure.
Show answer
Correct: When the two datasets use different id schemes (CUSIP versus ticker)
How does the CUSIP identifier let you link the independently created Company nodes to the existing Form nodes?
Show answer
CUSIP 6 is a universal company identifier that appears in *both* the Form 10-K data (on Form nodes) and the Form 13 data (on Company nodes), so the two independently built node sets connect with no manual matching: `MATCH (com:Company), (f:Form) WHERE com.cusip6 = f.cusip6 MERGE (com)-[:FILED]->(f)`. This is the canonical cross-dataset linking pattern — a shared, stable, universal key lets MERGE join records that were never created together. Common wrong answer: matching on company name, which is unstandardised and produces both missed and wrong matches.
CUSIP-based linking is automatic here. What breaks when you integrate two datasets that identify companies differently, and how do you handle it?
Show answer
CUSIP linking works because it is a universal identifier shared by both datasets. When two sources use *different* schemes — one CUSIP, another ISIN or ticker, or only company-name strings — there is no common value to MERGE on, and automatic linking silently produces missed matches (false negatives) and wrong matches (false positives). The fix is an entity-resolution layer: prefer a canonical id where one exists (DUNS, SEC CIK); otherwise fuzzy-match names, verify with secondary attributes (address, industry), and store a canonical node with an `aliases` property for the variants. Entity resolution is usually the hardest part of KG integration — the join is easy once the keys agree. Common wrong answer: "just match on name" — that is precisely what fails.
Which hop sequence reaches the investment managers, starting from a text chunk?
Why
The path crosses both datasets: PART_OF up to the Form, FILED across to the Company (traversed backward from the form), then OWNS_STOCK_IN to each Manager (backward from the company). PART_OF goes to a form not a company; NEXT links chunks, not chunk-to-manager; and SECTION points a form at its first chunk, not at companies.
Show answer
Correct: Chunk to Form (PART_OF), Form to Company (FILED), Company to Manager (OWNS_STOCK_IN)
How does a full-text index differ from a vector index, and when would you use each?
Show answer
A full-text index matches *strings* — exact, partial, and fuzzy keyword search with relevance scoring (e.g. find managers matching "Royal Bank"). A vector index matches *meaning* — embeddings whose semantics are close, even when the words differ. Use full-text when the query is a name or keyword whose spelling matters; use vector when you want conceptually similar content regardless of wording. They are complementary, and Neo4j adds a third — property indexes for exact lookup by a known id. Common wrong answer: "full-text is just a slower vector search" — they answer different questions (spelling vs meaning), not the same question at different speeds.
What is graph-to-text context generation, and why is it needed in a graph-RAG pipeline? Give an example sentence.
Show answer
Graph-to-text context generation converts structured traversal results — relationship triples of subject, predicate, object — into natural-language sentences the LLM can read as context. For an OWNS_STOCK_IN result you might emit "Royal Bank of Canada owns 1,200,000 shares of NetApp." The reason is simple: the traversal produces precise, connected facts, but an LLM consumes text, so the sentence form is how those facts enter the prompt. It is the consume direction of the graph-LLM loop; Module 7 inverts it by having the LLM generate the query. Common wrong answer: passing raw rows or JSON — workable, but prose is what the model reasons over most reliably.
Which two relationships bridge the Form 10-K document data and the Form 13 investment data, and how is each created?
Show answer
Two relationship types bridge the datasets. FILED connects Company to Form, created by matching on the shared cusip6 — `MATCH (com:Company), (f:Form) WHERE com.cusip6 = f.cusip6 MERGE (com)-[:FILED]->(f)`. OWNS_STOCK_IN connects Manager to Company, created per CSV row by matching both endpoints on their keys and MERGE-ing the edge with shares, value, and quarter — and because it is MERGE'd on `reportCalendarOrQuarter`, the same manager-company pair can hold one edge per reporting period. FILED is the document bridge (10-K data); OWNS_STOCK_IN is the investment bridge (Form 13 data). Common wrong answer: a single edge per manager-company pair, which would overwrite quarter-by-quarter history.
Write a Cypher query that finds the top 5 managers by total NetApp investment value and then returns each one’s name with the text of the filing’s business section (Item 1). What lets you compose the two steps?
Show answer
Compose with a WITH clause acting as a pipeline stage. First aggregate to get the top managers: `MATCH (mgr:Manager)-[owns:OWNS_STOCK_IN]->(com:Company) WITH mgr, sum(owns.value) AS total ORDER BY total DESC LIMIT 5`. Then continue the traversal from those managers down to document content: `MATCH (mgr)-[:OWNS_STOCK_IN]->(:Company)-[:FILED]->(f:Form)-[s:SECTION {f10kItem: "item1"}]-> (c:Chunk) RETURN mgr.managerName, total, c.text`. The WITH clause carries the top-5 result into the second match, the way a SQL subquery would, but reads as a sequential narrative. Common wrong answer: trying to filter the aggregate in a WHERE before WITH — aggregation results must pass through WITH first.
Design the Company and Manager node schemas for the Form 13 data: which identifier keys each node, and why those choices?
Show answer
Two node types, each keyed by a stable identifier. Company: keyed by `cusip6` (with companyName, cusip), because CUSIP is the universal id that also appears on the Form nodes — making it the cross-dataset link. Manager: keyed by `managerCik` (with managerName, managerAddress), because the CIK uniquely identifies a filing firm. Both are created with MERGE on those keys for idempotency. The design point: choose the key that is both stable *and* shared with the data you intend to link to — that is why Company uses CUSIP (shared with forms) rather than companyName. Common wrong answer: keying Company on companyName, which is an unstable, non-unique string.
What does the OWNS_STOCK_IN relationship carry as properties?
Why
The holding’s facts live on the edge: shares, value, and reportCalendarOrQuarter (the
quarter is part of its identity, since a manager can hold the same stock across periods). Names
and addresses belong on the Manager node; the linking ids key the nodes, not the edge; and
embeddings sit on Chunk nodes, not investment edges.
Show answer
Correct: Share count, monetary value, and the reporting quarter
The investment-enhanced chain dramatically improved the answer to “Who are NetApp’s investors?” but not to “Tell me about NetApp.” Explain why, and state the design lesson.
Show answer
Because graph context helps only when the question needs it. "Who are NetApp's investors?" is answered by the OWNS_STOCK_IN data, so the investment chain returns specific names and share counts while the plain chain is vague. "Tell me about NetApp" is a general business question the investor data does not address, so the LLM ignores the extra context and both chains answer about the same — except the investment chain paid extra latency and tokens to fetch context that went unused. The lesson: expand context to match the expected question distribution; unmatched expansion is noise, not value. Common wrong answer: "more context is always better" — irrelevant context is dead weight the model discards.
An FAQ system is at 72% satisfaction (target 85%). Of its failures, 40% are product-mismatch and 35% are missing-comparison. If a graph schema cuts each of those modes by 60%, does it reach the target? Show the arithmetic.
Show answer
Work in shares of all answers. The failure rate is 100 − 72 = 28 percent. Mode 1 is 28 × 0.40 = 11.2 percent of answers; Mode 2 is 28 × 0.35 = 9.8 percent. Cutting each by 60 percent removes 0.6 × 11.2 = 6.7 and 0.6 × 9.8 = 5.9 points, so the new failure rate is 28 − 6.7 − 5.9 = 15.4 percent and satisfaction rises to about 84.6 percent — essentially the 85 percent target. The ceiling worth noting: only the 75 percent of failures from these two modes is addressable; the remaining 25 percent (other causes) needs different fixes, so you cannot exceed roughly 85 percent by attacking these two alone. Common wrong answer: applying the 60 percent reduction to the whole 28 percent failure rate, which overcounts the gain.
A user asks “Which investment managers hold the company described in this filing, and how much?” Trace the end-to-end path through the schema, and say why pure vector search could not answer it.
Show answer
Trace it across both datasets. From a relevant chunk: PART_OF up to its Form; FILED backward from the form to the Company that filed it; OWNS_STOCK_IN backward from the company to each Manager, reading the holding's value and shares; optionally sort by value and take the top holders. So `(:Chunk)-[:PART_OF]->(:Form)<-[:FILED]-(:Company)<-[:OWNS_STOCK_IN]-(:Manager)`, then a graph-to-text sentence per manager for the LLM. This is exactly the question vector search alone cannot answer: similarity finds *text about* NetApp, but only the relationships reach the *investors*, which live in a second dataset linked by CUSIP. Common wrong answer: expecting vector search to surface investors — they are not in the chunk text at all.
Neo4j offers three search paradigms in one engine. Which mapping is correct?
Why
Vector indexes match meaning (semantic similarity over embeddings), full-text indexes match spelling (exact/partial/fuzzy keyword search with relevance scores), and property indexes do exact lookup by a known identifier. The distractors swap these roles — e.g. vector does not do keyword matching, and full-text does not index embeddings.
Show answer
Correct: Vector for meaning, full-text for keywords, property for exact lookup
Why convert graph-traversal results into natural-language sentences before passing them to the LLM?
Why
An LLM consumes text, so a relationship triple (manager–owns–company) is turned into a readable sentence the model can reason over as context — that is how structured graph facts enter a RAG prompt. It is not about storage compression, encryption, or building another embedding index.
Show answer
Correct: So the LLM can read the relationship facts as context it reasons over
Graph-RAG Chat
What properties does an Address node carry, and how does Neo4j store geospatial data?
Show answer
An Address node holds city, state, and a geospatial `location` property — a latitude/longitude point created with `point({latitude: x, longitude: y})`. Neo4j stores geospatial data as this native point type, which `point.distance()` consumes to compute distances in metres. Address nodes connect to Company and Manager nodes via LOCATED_AT, so the graph can answer "what's near X?" questions. They are themselves Extract-Enhance-Expand: extract from CSV, enhance with the point, expand with LOCATED_AT. Common wrong answer: storing latitude and longitude as two plain numbers — they belong in a single point value so point.distance() and a point index can use them.
What is the correct order of the GraphCypherQAChain pipeline?
Why
The chain runs question → LLM generates Cypher → Neo4j executes it → LLM synthesises a natural-language answer from the rows. Generation must precede execution; the chain uses LLM-generated Cypher (not a vector-search pipeline), and it does not produce embeddings.
Show answer
Correct: Question, generate Cypher, execute it, synthesise the answer
Describe the four steps of the GraphCypherQAChain pipeline, in order, and one thing it needs to generate reliable Cypher.
Show answer
Four steps. (1) Question — the user asks in natural language. (2) Generate — the LLM writes Cypher from the injected schema and few-shot examples. (3) Execute — Neo4j runs the generated query. (4) Answer — the LLM synthesises the returned rows into a natural-language reply. The chain (GraphCypherQAChain) needs the schema and a good few-shot prompt for reliable generation, and should validate the query before executing in production. Common wrong answer: putting execution before generation, or assuming it is vector retrieval — this pipeline generates and runs Cypher, it does not embed and search.
To find entities within 25 km, what value do you compare point.distance() against?
Why
point.distance() returns metres, so 25 km is < 25000. The off-by-1000 bug — writing 25
(km) instead of 25000 — is the most common geospatial mistake; the function never returns
kilometres or centimetres.
Show answer
Correct: 25000, since the function returns metres
Describe the three phases of the Extract-Enhance-Expand pattern, with one course example of each.
Show answer
Three phases, repeated for every data source. Extract — pull source data into nodes (split text into Chunks; build Company, Manager, Address nodes from CSV). Enhance — add indexes or computed properties (vector embeddings on chunks, a full-text index on manager names, geospatial points on addresses). Expand — connect the new nodes into the existing graph with relationships (NEXT and PART_OF for chunks, FILED and OWNS_STOCK_IN for the investment data, LOCATED_AT for addresses). It is the repeatable framework the whole course used to grow the graph incrementally. Common wrong answer: collapsing Enhance into Extract — adding the index is a distinct phase, and it is what makes the nodes searchable.
Creating a full-text index on manager names is which phase of Extract-Enhance-Expand?
Why
Adding an index (or computed property) to nodes that already exist is the Enhance phase — like vector embeddings or geospatial points. Extract creates the nodes, Expand connects them with relationships, and validation isn’t one of the three phases.
Show answer
Correct: Enhance — it adds an index to existing nodes
Design a 3-example few-shot prompt that teaches an LLM to answer single-hop, multi-hop, and aggregation questions over a graph. What principle governs the choice of examples?
Show answer
Inject the schema, instruct "use only these types; do not hallucinate", then give three examples that each teach a *different* query shape. For a university graph: (1) single-hop with a filter — "Who teaches Machine Learning?" maps to matching a Professor TEACHES a Course with that title; (2) multi-hop — "What courses do Dr. Smith's students take?" maps to Professor ADVISES Student ENROLLED_IN Course; (3) aggregation — "Which departments have the most courses?" maps to Course BELONGS_TO Department with count and ORDER BY. The design rule: one example per capability (filter, traverse, aggregate), because the LLM generalises the *shape*, not the specific question. Common wrong answer: three near-identical filter examples — they teach one pattern and waste the few-shot budget.
Is it better to give a Cypher-generation prompt ten similar examples or three diverse ones? Explain the trade-off.
Show answer
Prefer a few diverse examples. Two or three that span *different* query shapes — a property filter, an aggregation, a multi-hop traversal — teach the LLM more than ten variations of one shape, because the model generalises patterns and redundant examples add no new pattern. Past two or three diverse examples, returns diminish quickly while prompt length (and cost) grows. So the selection rule is coverage of distinct shapes, not quantity. Common wrong answer: "more examples always help" — beyond covering the distinct shapes, extra similar examples mostly add tokens, not capability.
Why inject the graph schema and a “do not hallucinate” instruction into a Cypher-generation prompt?
Why
The schema plus the instruction constrain the LLM to real relationship types and properties, preventing schema hallucination — Cypher that looks valid but references things the graph doesn’t have. It doesn’t change execution speed, doesn’t remove the need for examples, and doesn’t bypass running the query.
Show answer
Correct: To stop the LLM inventing relationship types that aren't in the graph
Name the four retrieval paradigms the finished knowledge graph supports, with a question type suited to each.
Show answer
Four, each suited to a different question. Vector search — semantic similarity for conceptual questions ("tell me about cloud storage"). Full-text search — keyword/string lookup for entities ("find Royal Bank of Canada"). Graph traversal — relationship following for structural questions ("who are NetApp's investors?"). Geospatial search — distance with point.distance() for location questions ("companies within 10 km of San Jose"). No single paradigm answers every question; the finished graph's power is holding all four in one engine, and complex questions compose them. Common wrong answer: treating vector search as a catch-all — it finds meaning, not exact names, relationships, or distances.
Write a Cypher query that finds companies within 20 km of San Jose, sorted nearest-first. What unit trap must you avoid, and what index keeps it fast?
Show answer
Match the reference location, match candidate companies through LOCATED_AT, filter by distance, and sort. For example: `MATCH (sj:Address {city: "San Jose"}) MATCH (com:Company)-[:LOCATED_AT]-> (a:Address) WHERE point.distance(sj.location, a.location) < 20000 RETURN com.companyName, point.distance(sj.location, a.location)/1000 AS km ORDER BY km ASC`. The threshold is 20000 because `point.distance()` returns metres (20 km, not 20). For performance, a point index on the location property turns the scan into an index range query. Common wrong answer: writing `< 20`, which asks for entities within 20 metres and returns almost nothing.
What role does the LOCATED_AT relationship play, and how is it used in a query that finds managers near a company?
Show answer
LOCATED_AT connects a Company or Manager to its Address node, so any geographic question first traverses it to reach coordinates. To find managers near a company: full-text or match the company, traverse `(:Company)-[:LOCATED_AT]->(:Address)` for its location, traverse `(:Manager)-[:LOCATED_AT]->(:Address)` for candidate locations, then filter with `point.distance(companyAddr.location, mgrAddr.location) < radius`. Without LOCATED_AT the entities have no location to compare, so distance queries are impossible — it is the Expand step that makes geography first-class. Common wrong answer: putting coordinates directly on the Company node — modelling Address as its own node lets multiple entities share a location and keeps the point in one indexed place.
“Find investment firms within 10 km of Apple’s headquarters.” Which retrieval approach does this question need?
Why
It composes paradigms: full-text (or exact lookup) to resolve “Apple” to its node and address,
then a geospatial point.distance() filter for firms within 10 km. No single paradigm suffices —
vector handles meaning not distance, traversal alone can’t compute proximity, and full-text alone
can’t filter by radius.
Show answer
Correct: Full-text to find Apple, then geospatial distance to nearby firms
For each question, name the retrieval paradigm(s) and why: (a) “What are the risk factors for cloud-computing companies?” (b) “Find Goldman Sachs.” (c) “Which investors are within 50 miles of NetApp?”
Show answer
Decompose each into paradigm-matched steps. (a) "Risk factors for cloud-computing companies" — vector search for chunks about risk factors, then graph traversal (PART_OF to Form to Company) to confirm they belong to cloud companies. (b) "Find Goldman Sachs" — full-text search; it is an entity lookup by name, and vector would wrongly surface similar names like Morgan Stanley. (c) "Investors within 50 miles of NetApp" — full-text to find NetApp, LOCATED_AT to its address, geospatial point.distance under 80467 m for nearby managers, then OWNS_STOCK_IN to confirm they invest. The skill is reading a question as a sequence of retrieval steps. Common wrong answer: forcing one paradigm — most real questions combine two or more.
An LLM generates Cypher referencing a relationship type that doesn’t exist in your graph. Which fix most directly addresses this?
Why
Schema hallucination is fixed by grounding the model in the real schema and checking its output: inject the actual node/relationship types, instruct “use only these”, and validate before executing. Raising temperature makes invention more likely, removing the schema removes the guardrail, and the search paradigm is unrelated to the generation step.
Show answer
Correct: Inject the real schema and validate the query before running it
Why does adding one “What does company X do?” example to the few-shot prompt unlock document navigation for many companies, not just X?
Show answer
Because the example demonstrates a reusable query *shape*, not a one-off answer. "What does X do?" shows full-text find the company, then traverse FILED to its Form and SECTION to the Item-1 chunk, and return that text. The LLM generalises this shape to any company name, so the single example unlocks document navigation broadly. That is progressive few-shot: each added example teaches one new capability (city filter, then geospatial distance, then document navigation) the model applies to unseen questions. Common wrong answer: assuming it only answers about that one company — the model learns the pattern, not the instance.
What is schema hallucination in LLM-generated Cypher, how would you detect it, and how do you prevent it?
Show answer
Schema hallucination is when an LLM generates Cypher that references relationship types or properties the graph does not have — syntactically valid, plausible-looking, but invalid against the real schema (so it errors or returns nothing). Catch it by reading the generated query against the schema before trusting the answer; with verbose mode on, inspect the Cypher the chain produced. Prevent it by injecting the actual schema, instructing "use only these types; do not hallucinate", giving few-shot examples that use the correct relationships, and validating the query before executing in production. Common wrong answer: "raise the temperature for better queries" — higher temperature increases invention, making hallucination worse.