Chatting with the Knowledge Graph
The capstone: the Extract-Enhance-Expand pattern for growing graphs, Address nodes with geospatial points and point.distance() search, teaching an LLM to generate Cypher with few-shot prompts, the GraphCypherQAChain for end-to-end natural-language querying, schema-hallucination guards, and the four retrieval paradigms (vector, full-text, traversal, geospatial) the finished graph unites.
On this page
The pattern behind the whole course: Extract-Enhance-Expand
Extract-Enhance-Expand
KGR-7.1Every module repeated one three-phase move:
- Extract — pull source data into nodes (text → Chunks; CSV → Company / Manager / Address nodes).
- Enhance — add indexes or computed properties (vector embeddings, full-text indexes, geospatial points).
- Expand — connect the new nodes to the existing graph (NEXT, PART_OF, FILED, OWNS_STOCK_IN, LOCATED_AT).
It’s a repeatable framework for growing any graph incrementally. When it breaks: Expand assumes the new data links to existing nodes by a shared identifier — when identifiers are absent or ambiguous (name-only matching), Expand needs an entity-resolution step the pattern doesn’t itself provide. [V] Verified
| Module | Extract | Enhance | Expand | | --- | --- | --- | --- | | M4 Documents | Chunk nodes | vector embeddings | — | | M5 Structure | Form nodes | section index | NEXT, PART_OF, SECTION | | M6 Investments | Company, Manager | full-text index | FILED, OWNS_STOCK_IN | | M7 Geography | Address nodes | geospatial points | LOCATED_AT |
Name the three phases of Extract-Enhance-Expand and give one example of each from the course. KGR-7.1
Extract — create nodes from source data (e.g. split filing text into Chunk nodes, or build Manager nodes from the Form 13 CSV). Enhance — add indexes or computed properties (vector embeddings on chunks, a full-text index on manager names, geospatial points on addresses). Expand — connect the new nodes into the existing graph with relationships (NEXT/PART_OF for chunks, FILED/OWNS_STOCK_IN for the investment data, LOCATED_AT for addresses).
A fourth paradigm: geospatial search
The graph is pre-expanded with Address nodes
(via LOCATED_AT), each holding a
geospatial point — latitude/longitude.
That unlocks distance queries with
point.distance():
MATCH (sc:Address {city: "Santa Clara"})
MATCH (com:Company)-[:LOCATED_AT]->(comAddr:Address)
WHERE point.distance(sc.location, comAddr.location) < 10000
RETURN com.companyName
Write a query to find companies within 20 km of San Jose, and say what unit point.distance returns. KGR-7.2
MATCH (sj:Address {city: "San Jose"}) MATCH (com:Company)-[:LOCATED_AT]->(a:Address) WHERE point.distance(sj.location, a.location) < 20000 RETURN com.companyName. point.distance() returns metres, so 20 km is < 20000 (not 20). A point index on the location property keeps the range query fast.
Teaching an LLM to write Cypher
The capstone capability: instead of hand-writing Cypher, give the LLM the schema plus a few examples and let it translate plain English into queries — a few-shot prompt turns the model into a “Cypher compiler”.
Task: Generate Cypher to query a graph database.
- Use ONLY the provided relationship types and properties.
- Do not hallucinate relationship types or properties.
Schema: {schema}
# What investment firms are in San Francisco?
MATCH (mgr:Manager)-[:LOCATED_AT]->(a:Address)
WHERE a.city = "San Francisco" RETURN mgr.managerName
Question: {question}
That prompt drives a GraphCypherQAChain, which runs the full loop end to end:
Progressive few-shot: diversity beats quantity
KGR-7.5Each progressively added example
unlocks a new query shape — city filter, then point.distance geospatial, then
full-text + SECTION document navigation — and the LLM generalises each to unseen
questions. Two or three diverse examples (filter, aggregate, traverse) cover
far more than ten similar ones; past that, returns diminish fast. Choose for
pattern coverage, not count. [V] Verified
Why does adding one 'What does company X do?' example unlock document navigation for many other companies? KGR-7.5
The example demonstrates a query shape — full-text find the company, then traverse FILED → SECTION → Chunk to its business section — not a single hard-coded answer. The LLM generalises the pattern, so it can apply the same full-text-then-traverse structure to any company name. That’s progressive few-shot: each diverse example teaches one new capability the model reuses.
What is schema hallucination in LLM-generated Cypher, and how do you prevent it? KGR-7.6
Schema hallucination is when the LLM generates Cypher referencing relationship types or properties that don’t exist in the graph — plausible-looking but invalid. Prevent it by injecting the real schema into the prompt, instructing the model to use only those types and “do not hallucinate”, supplying few-shot examples that use the correct relationships, and validating the generated query before running it in production.
Four retrieval paradigms, one engine
The finished graph unites four paradigms
KGR-7.7No single retrieval method answers every question — the graph’s power is holding all four in one query engine:
- Vector — semantic similarity for conceptual questions (“about cloud storage”).
- Full-text — keyword/string lookup for entities (“find Royal Bank of Canada”).
- Graph traversal — relationship following for structure (“who invests in NetApp?”).
- Geospatial — distance for location (“companies within 10 km of San Jose”).
(These are retrieval methods; Module 6’s “three search paradigms” counted index types — vector, full-text, property — with exact-lookup folding in here alongside the new traversal and geospatial methods.) Complex questions compose them — “investors within 50 miles of NetApp” needs full-text + geospatial + traversal at once. [V] Verified
| Paradigm | Method | Example question |
| --- | --- | --- |
| Vector | db.index.vector.queryNodes | “Tell me about cloud storage” |
| Full-text | db.index.fulltext.queryNodes | “Find Palo Alto Networks” |
| Traversal | MATCH path patterns | “Who invests in NetApp?” |
| Geospatial | point.distance() | “Companies within 10 km of San Jose” |
Name the four retrieval paradigms and give one question type suited to each. KGR-7.7
Vector — conceptual/semantic questions (“what is NetApp’s business?”). Full-text — entity lookup by name (“find Royal Bank of Canada”). Graph traversal — relationship/structural questions (“who are NetApp’s investors?”). Geospatial — location/distance questions (“companies within 10 km of San Jose”). Most real questions combine two or more.
Summary — and the course in one arc
This module added geospatial search (point.distance(), in metres, over Address
nodes) and LLM-generated Cypher: a few-shot prompt plus the
GraphCypherQAChain turn a plain-English question into a generated query, an
execution, and a natural-language answer — guarded against schema hallucination
by injecting the real schema. The finished graph unites four retrieval paradigms
— vector, full-text, traversal, geospatial — a toolkit far richer than similarity
search alone, all grown by the same Extract-Enhance-Expand move.
Across seven chapters the guide built one system: graph fundamentals (M1) → Cypher (M2) → vector search in the graph (M3) → constructing a graph from documents (M4) → relationships and chunk windows (M5) → expanding with a second dataset (M6) → conversational, multi-paradigm querying (M7). The throughline: retrieval by connection, not just similarity.
- This is the retrieval guide of three — see also Fine-tuning & RL (training a model) and Evaluating AI Agents (proving a system works).