Part 1 Chapter 7 Last verified 2026-06-19

Chatting with the Knowledge Graph

The capstone: the Extract-Enhance-Expand pattern for growing graphs, Address nodes with geospatial points and point.distance() search, teaching an LLM to generate Cypher with few-shot prompts, the GraphCypherQAChain for end-to-end natural-language querying, schema-hallucination guards, and the four retrieval paradigms (vector, full-text, traversal, geospatial) the finished graph unites.

On this page

The pattern behind the whole course: Extract-Enhance-Expand
A fourth paradigm: geospatial search
Teaching an LLM to write Cypher
Four retrieval paradigms, one engine
Summary — and the course in one arc

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Teaching an LLM to write Cypher; if any is shaky, read closely — each is developed below.

Predict before reading — check each against the chapter.

point.distance() returns a number for “within 10 km”. Predict the value you compare against — 10, 10000, or 0.1 — and the unit.
You ask an LLM to generate Cypher but give it no schema. Predict the most likely failure in its output.
Few-shot prompting: you add a third example, “What does company X do?”. Predict the new capability that one example unlocks.
“Which investors are within 50 miles of NetApp?” Predict how many of the four retrieval paradigms this single question needs.

Check your answers

10000, in metres — point.distance() returns metres, so 10 km is < 10000. Using 10 (km) is the classic off-by-1000 bug.
Schema hallucination — it invents plausible but non-existent relationship types or properties. Inject the real schema and instruct “do not hallucinate”.
Document navigation — full-text find the company, then traverse FILED → SECTION → Chunk to its filing text. One example teaches a whole new query shape.
Three — full-text (find NetApp), geospatial (point.distance within 80,467 m), and graph traversal (OWNS_STOCK_IN to confirm they’re investors). Complex questions compose paradigms.

The pattern behind the whole course: Extract-Enhance-Expand

Key concept

Extract-Enhance-Expand

KGR-7.1

Every module repeated one three-phase move:

Extract — pull source data into nodes (text → Chunks; CSV → Company / Manager / Address nodes).
Enhance — add indexes or computed properties (vector embeddings, full-text indexes, geospatial points).
Expand — connect the new nodes to the existing graph (NEXT, PART_OF, FILED, OWNS_STOCK_IN, LOCATED_AT).

It’s a repeatable framework for growing any graph incrementally. When it breaks: Expand assumes the new data links to existing nodes by a shared identifier — when identifiers are absent or ambiguous (name-only matching), Expand needs an entity-resolution step the pattern doesn’t itself provide. [V] Verified

| Module | Extract | Enhance | Expand | | --- | --- | --- | --- | | M4 Documents | Chunk nodes | vector embeddings | — | | M5 Structure | Form nodes | section index | NEXT, PART_OF, SECTION | | M6 Investments | Company, Manager | full-text index | FILED, OWNS_STOCK_IN | | M7 Geography | Address nodes | geospatial points | LOCATED_AT |

Name the three phases of Extract-Enhance-Expand and give one example of each from the course. KGR-7.1

Extract — create nodes from source data (e.g. split filing text into Chunk nodes, or build Manager nodes from the Form 13 CSV). Enhance — add indexes or computed properties (vector embeddings on chunks, a full-text index on manager names, geospatial points on addresses). Expand — connect the new nodes into the existing graph with relationships (NEXT/PART_OF for chunks, FILED/OWNS_STOCK_IN for the investment data, LOCATED_AT for addresses).

A fourth paradigm: geospatial search

The graph is pre-expanded with Address nodes (via LOCATED_AT), each holding a geospatial point — latitude/longitude. That unlocks distance queries with point.distance():

MATCH (sc:Address {city: "Santa Clara"})
MATCH (com:Company)-[:LOCATED_AT]->(comAddr:Address)
WHERE point.distance(sc.location, comAddr.location) < 10000
RETURN com.companyName

Geospatial proximity query Worked example

Problem. A real-estate graph has Property and School nodes with geospatial locations. Find schools within 2 km of a property at (37.3861, -122.0839), closest first. What index makes it fast?

Reasoning. Build the property’s point, match schools, filter by point.distance(...) < 2000 (2 km = 2000 m — not 2), and sort ascending by distance:

WITH point({latitude: 37.3861, longitude: -122.0839}) AS prop
MATCH (s:School)
WHERE point.distance(prop, s.location) < 2000
RETURN s.name, point.distance(prop, s.location)/1000 AS distanceKm
ORDER BY distanceKm ASC

Performance needs a point index: CREATE POINT INDEX ... FOR (s:School) ON (s.location) — without it Neo4j scans every school and computes distance for each.

Answer. Threshold 2000 m (the metres bug is using 2), ORDER BY distance ASC for nearest-first, and a point index to avoid the O(n) scan.

Write a query to find companies within 20 km of San Jose, and say what unit point.distance returns. KGR-7.2

MATCH (sj:Address {city: "San Jose"}) MATCH (com:Company)-[:LOCATED_AT]->(a:Address) WHERE point.distance(sj.location, a.location) < 20000 RETURN com.companyName. point.distance() returns metres, so 20 km is < 20000 (not 20). A point index on the location property keeps the range query fast.

Teaching an LLM to write Cypher

The capstone capability: instead of hand-writing Cypher, give the LLM the schema plus a few examples and let it translate plain English into queries — a few-shot prompt turns the model into a “Cypher compiler”.

Task: Generate Cypher to query a graph database.
- Use ONLY the provided relationship types and properties.
- Do not hallucinate relationship types or properties.
Schema: {schema}

# What investment firms are in San Francisco?
MATCH (mgr:Manager)-[:LOCATED_AT]->(a:Address)
WHERE a.city = "San Francisco" RETURN mgr.managerName

Question: {question}

That prompt drives a GraphCypherQAChain, which runs the full loop end to end:

Key concept

Progressive few-shot: diversity beats quantity

KGR-7.5

Each progressively added example unlocks a new query shape — city filter, then point.distance geospatial, then full-text + SECTION document navigation — and the LLM generalises each to unseen questions. Two or three diverse examples (filter, aggregate, traverse) cover far more than ten similar ones; past that, returns diminish fast. Choose for pattern coverage, not count. [V] Verified

Why does adding one 'What does company X do?' example unlock document navigation for many other companies? KGR-7.5

The example demonstrates a query shape — full-text find the company, then traverse FILED → SECTION → Chunk to its business section — not a single hard-coded answer. The LLM generalises the pattern, so it can apply the same full-text-then-traverse structure to any company name. That’s progressive few-shot: each diverse example teaches one new capability the model reuses.

What is schema hallucination in LLM-generated Cypher, and how do you prevent it? KGR-7.6

Schema hallucination is when the LLM generates Cypher referencing relationship types or properties that don’t exist in the graph — plausible-looking but invalid. Prevent it by injecting the real schema into the prompt, instructing the model to use only those types and “do not hallucinate”, supplying few-shot examples that use the correct relationships, and validating the generated query before running it in production.

Four retrieval paradigms, one engine

Key concept

The finished graph unites four paradigms

KGR-7.7

No single retrieval method answers every question — the graph’s power is holding all four in one query engine:

Vector — semantic similarity for conceptual questions (“about cloud storage”).
Full-text — keyword/string lookup for entities (“find Royal Bank of Canada”).
Graph traversal — relationship following for structure (“who invests in NetApp?”).
Geospatial — distance for location (“companies within 10 km of San Jose”).

(These are retrieval methods; Module 6’s “three search paradigms” counted index types — vector, full-text, property — with exact-lookup folding in here alongside the new traversal and geospatial methods.) Complex questions compose them — “investors within 50 miles of NetApp” needs full-text + geospatial + traversal at once. [V] Verified

| Paradigm | Method | Example question | | --- | --- | --- | | Vector | db.index.vector.queryNodes | “Tell me about cloud storage” | | Full-text | db.index.fulltext.queryNodes | “Find Palo Alto Networks” | | Traversal | MATCH path patterns | “Who invests in NetApp?” | | Geospatial | point.distance() | “Companies within 10 km of San Jose” |

Decompose a question into paradigms Worked example

Problem. Which paradigm(s) answer each: (a) “risk factors for cloud-computing companies?” (b) “find Goldman Sachs” (c) “investors within 50 miles of NetApp?”

Reasoning. (a) Vector + traversal — semantic search for “risk factors”, then PART_OF → Form → Company to confirm they’re cloud companies. (b) Full-text — an entity lookup; vector would wrongly surface similar names like Morgan Stanley. (c) Full-text + geospatial + traversal — full-text find NetApp, LOCATED_AT to its address, point.distance < 80467 (50 mi) to nearby managers, OWNS_STOCK_IN to confirm they invest.

Answer. (a) vector + traversal, (b) full-text, (c) full-text + geospatial + traversal. The skill is decomposing a question into a sequence of paradigm-matched steps.

Name the four retrieval paradigms and give one question type suited to each. KGR-7.7

Vector — conceptual/semantic questions (“what is NetApp’s business?”). Full-text — entity lookup by name (“find Royal Bank of Canada”). Graph traversal — relationship/structural questions (“who are NetApp’s investors?”). Geospatial — location/distance questions (“companies within 10 km of San Jose”). Most real questions combine two or more.

Summary — and the course in one arc

This module added geospatial search (point.distance(), in metres, over Address nodes) and LLM-generated Cypher: a few-shot prompt plus the GraphCypherQAChain turn a plain-English question into a generated query, an execution, and a natural-language answer — guarded against schema hallucination by injecting the real schema. The finished graph unites four retrieval paradigms — vector, full-text, traversal, geospatial — a toolkit far richer than similarity search alone, all grown by the same Extract-Enhance-Expand move.

Across seven chapters the guide built one system: graph fundamentals (M1) → Cypher (M2) → vector search in the graph (M3) → constructing a graph from documents (M4) → relationships and chunk windows (M5) → expanding with a second dataset (M6) → conversational, multi-paradigm querying (M7). The throughline: retrieval by connection, not just similarity.

This is the retrieval guide of three — see also Fine-tuning & RL (training a model) and Evaluating AI Agents (proving a system works).