Adding Relationships to the Knowledge Graph
Turning the flat chunk collection into a navigable knowledge graph: a Form node per filing, a NEXT linked list of chunks (section-filtered), PART_OF and SECTION connections, variable-length paths (*0..1) for boundary handling, and chunk-window retrieval that expands RAG context by combining vector search with graph traversal.
On this page
The Form node
First, a Form node represents the filing
itself, carrying the metadata the chunks share — formId, source, cik, and
cusip6:
CREATE (f:Form)
SET f.formId = $formInfoParam.formId,
f.source = $formInfoParam.source,
f.cik = $formInfoParam.cik,
f.cusip6 = $formInfoParam.cusip6
The chunk linked list (NEXT)
Chunks are wired into a NEXT linked
list that preserves reading order — but only within a section. Two filters are
mandatory: same formId and same item.
MATCH (c1:Chunk), (c2:Chunk)
WHERE c1.formId = $formIdParam AND c1.item = $itemParam
AND c2.formId = $formIdParam AND c2.item = $itemParam
AND c2.chunkSeqId = c1.chunkSeqId + 1
MERGE (c1)-[:NEXT]->(c2)
MERGE (not CREATE) keeps the step idempotent — re-running a section won’t
duplicate NEXT edges.
Why must NEXT-relationship creation filter by section, and what specifically goes wrong without it? KGR-5.7
Chunk sequence ids are assigned within a section (restarting at 0), and NEXT links chunks whose ids differ by one. Without the item filter, the chunkSeqId + 1 join matches by number across the whole form — so a chunk links to the next-numbered chunk in other sections too, forking the linked list across boundaries. A chunk window built on it then pulls in irrelevant context from an adjacent section and degrades answer quality. Filtering by both formId and item keeps NEXT inside one section.
Connecting chunks to the form (PART_OF, SECTION)
Two more relationship types complete the schema. Every chunk links to its parent form with PART_OF; the form links to the first chunk of each section with SECTION, tagged by section name:
// every chunk to its form
MATCH (c:Chunk), (f:Form) WHERE c.formId = f.formId
MERGE (c)-[:PART_OF]->(f)
// form to the first chunk (seq 0) of each section
MATCH (c:Chunk), (f:Form) WHERE c.formId = f.formId AND c.chunkSeqId = 0
MERGE (f)-[s:SECTION {f10kItem: c.item}]->(c)
Name the three relationship types added in this module, their directions, and what each enables. KGR-5.1
NEXT (Chunk → Chunk) — a linked list preserving document order, enabling sequential traversal and chunk windows. PART_OF (Chunk → Form) — document membership, enabling traversal from any chunk back to its filing. SECTION (Form → Chunk, carrying f10kItem) — an entry point to the first chunk of each section, enabling a direct jump to where a section begins.
Chunk windows and variable-length paths
A chunk window retrieves a target chunk plus its neighbours so the LLM sees continuous context. A window of chunks uses hops in each direction; the default gives a 3-chunk window. The naive fixed pattern, though, breaks at the ends of the list:
// fixed 3-node window: FAILS at boundaries
MATCH (before:Chunk)-[:NEXT]->(middle:Chunk)-[:NEXT]->(after:Chunk)
WHERE middle.chunkId = $chunkIdParam
RETURN before, middle, after
Variable-length paths degrade gracefully at boundaries
KGR-5.3The first chunk has no predecessor and the last has no successor, so a fixed
two-hop window returns zero results there. A
variable-length path *0..1 matches
zero or one NEXT hops, so the pattern still matches at the ends: at the first
chunk before collapses onto middle (0 hops back); at the last, after
collapses onto middle. Order by path length descending and LIMIT 1 to keep
the longest available window. [V] Verified
MATCH window = (before:Chunk)-[:NEXT*0..1]->(middle:Chunk)-[:NEXT*0..1]->(after:Chunk)
WHERE middle.chunkId = $chunkIdParam
WITH window, length(window) AS windowLength
ORDER BY windowLength DESC
LIMIT 1
RETURN window
What does the *0..1 notation match, and why does it fix the fixed-window failure at the first and last chunks? KGR-5.3
*0..1 matches zero or one NEXT hops (a variable-length path). At a boundary a fixed one-hop side has nothing to match and the whole pattern returns nothing; with *0..1 the missing side simply matches zero hops (the endpoint collapses onto the target), so the query still returns the available neighbours. Ordering by length(window) DESC LIMIT 1 then keeps the longest window that exists at that position.
Enhanced RAG: vector search meets graph traversal
The payoff is a custom retrieval query:
Neo4jVector finds the best-matching chunk by vector similarity, then a Cypher
tail expands it into a window before the chunks reach the LLM. You pass it via
the retrieval_query argument; it receives the matched node and its score:
WITH node, score
MATCH window = (:Chunk)-[:NEXT*0..1]->(node)-[:NEXT*0..1]->(:Chunk)
WITH node, score, window ORDER BY length(window) DESC LIMIT 1
UNWIND nodes(window) AS chunk
RETURN chunk.text AS text, score, {source: chunk.source} AS metadata
The combination beats either technique alone
KGR-5.6Vector search alone finds a relevant point; graph traversal alone has no notion of relevance. Together, vector search locates the entry chunk and the window adds the surrounding context the LLM needs for a complete answer. Concretely: asked for NetApp’s primary business, the single-chunk answer says “enterprise storage and data management”; the windowed answer adds the Keystone product detail that lived in the adjacent chunk. When it breaks: if NEXT crosses sections (missing filter) the window injects irrelevant context, and very large windows () can dilute the signal past the LLM’s useful attention. [V] Verified
How does a custom retrieval query extend Neo4jVector's default behaviour, and what two values does it receive from the vector search? KGR-5.5
The default retriever returns the matched chunk’s text as-is. A custom retrieval_query runs after the vector search and receives the matched node and its similarity score; from there it can traverse the graph — here following NEXT to build a chunk window — so the LLM gets expanded context, not a lone chunk. It returns text, score, and a metadata map.
When is graph augmentation worth it?
A team gets 78 percent accuracy from pure vector RAG and 89 percent from graph-augmented RAG at higher cost and latency. Describe a hybrid that captures most of the gain more cheaply. KGR-5.6
Run vector search first (the cheap path), then conditionally expand via graph traversal only when a cheap signal says it’s needed — the top similarity score is below a threshold, or the question references multiple entities. The expensive graph window then fires on only the ~30–40% of queries that benefit, capturing most of the accuracy/completeness gain while keeping average cost and latency near the pure-vector baseline.
Summary
Adding NEXT, PART_OF, and SECTION relationships turns a flat vector store into a
navigable knowledge graph. NEXT linked lists preserve document order — built
per section so they never bleed across boundaries. PART_OF gives every chunk a
path back to its form; SECTION gives the form indexed entry points into each
section. Variable-length paths (*0..1) make chunk windows degrade gracefully at
the list ends, and a custom retrieval query fuses vector search (find the entry
chunk) with graph traversal (expand the window) in a single step — the core move
of knowledge-graph-powered RAG. Whether to pay for that traversal is a
cost/benefit call: worth it when answers span boundaries, and a score-gated
hybrid captures most of the gain cheaply.
- Chapter 6 — expanding the graph: Company and investor nodes (Form 13 data) join the forms via the CUSIP bridge.
- Chapter 7 — answering questions by having an LLM generate Cypher across the connected graph.