Part 1 Chapter 5 Last verified 2026-06-19

Adding Relationships to the Knowledge Graph

Turning the flat chunk collection into a navigable knowledge graph: a Form node per filing, a NEXT linked list of chunks (section-filtered), PART_OF and SECTION connections, variable-length paths (*0..1) for boundary handling, and chunk-window retrieval that expands RAG context by combining vector search with graph traversal.

On this page

The Form node
The chunk linked list (NEXT)
Connecting chunks to the form (PART_OF, SECTION)
Chunk windows and variable-length paths
Enhanced RAG: vector search meets graph traversal
When is graph augmentation worth it?
Summary

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Connecting chunks to the form; if any is shaky, read closely — each is developed below.

Predict before reading — check each against the chapter.

You create NEXT relationships across all chunks ordered by chunkSeqId, forgetting to filter by section. Predict the structural bug this introduces.
A fixed 3-chunk window before → middle → after is run on the first chunk of a section. Predict what it returns.
Adding a chunk window lifted answer completeness from 62% to 84% but added latency and cost. Predict the one-line rule for when that trade is worth paying.
You have a chunk and want the first chunk of its section. Predict which two relationship types you traverse.

Check your answers

Cross-section NEXT links: with only the formId filter, c2.chunkSeqId = c1.chunkSeqId + 1 also matches chunks in other sections that carry the next sequence number — so each chunk links to the next-numbered chunk in every section, and the linked list forks and bleeds across section boundaries, after which windows pull in irrelevant context.
Zero results — the first chunk has no predecessor, so the fixed before hop fails to match and the whole pattern returns nothing. Variable-length *0..1 fixes this.
Pay for graph augmentation when questions actually span chunk/document boundaries and answer quality drives value; for self-contained FAQ lookups, pure vector is enough.
PART_OF (chunk → its Form) then SECTION (Form → the first chunk of that section) — a two-hop traversal up to the form and back down to the section entry point.

The Form node

First, a Form node represents the filing itself, carrying the metadata the chunks share — formId, source, cik, and cusip6:

CREATE (f:Form)
SET f.formId = $formInfoParam.formId,
    f.source = $formInfoParam.source,
    f.cik    = $formInfoParam.cik,
    f.cusip6 = $formInfoParam.cusip6

The chunk linked list (NEXT)

Chunks are wired into a NEXT linked list that preserves reading order — but only within a section. Two filters are mandatory: same formId and same item.

MATCH (c1:Chunk), (c2:Chunk)
WHERE c1.formId = $formIdParam AND c1.item = $itemParam
  AND c2.formId = $formIdParam AND c2.item = $itemParam
  AND c2.chunkSeqId = c1.chunkSeqId + 1
MERGE (c1)-[:NEXT]->(c2)

MERGE (not CREATE) keeps the step idempotent — re-running a section won’t duplicate NEXT edges.

Build a NEXT linked list and count the edges Worked example

Problem. A paper has two sections — Introduction (chunks seq 0–4) and Methods (chunks seq 0–4), sequence ids restarting per section as in the SEC pipeline. You build NEXT with the section filter. How many edges result, and what does omitting the filter do?

Reasoning. With the item filter, NEXT links consecutive chunks within each section: Introduction 0→1→2→3→4 (4 edges) and Methods 0→1→2→3→4 (4 edges) = 8. Omitting the filter, the chunkSeqId + 1 join matches by number across the whole form, so Introduction’s seq-i chunk also links to Methods’ seq-(i+1) chunk (and vice versa) — spurious cross-section edges that fork the list.

Answer. 8 NEXT edges with the filter. Without it, every chunk also links to the next-numbered chunk in the other section, corrupting the list — which is why the item filter is a correctness requirement, not optional tuning.

Why must NEXT-relationship creation filter by section, and what specifically goes wrong without it? KGR-5.7

Chunk sequence ids are assigned within a section (restarting at 0), and NEXT links chunks whose ids differ by one. Without the item filter, the chunkSeqId + 1 join matches by number across the whole form — so a chunk links to the next-numbered chunk in other sections too, forking the linked list across boundaries. A chunk window built on it then pulls in irrelevant context from an adjacent section and degrades answer quality. Filtering by both formId and item keeps NEXT inside one section.

Connecting chunks to the form (PART_OF, SECTION)

Two more relationship types complete the schema. Every chunk links to its parent form with PART_OF; the form links to the first chunk of each section with SECTION, tagged by section name:

// every chunk to its form
MATCH (c:Chunk), (f:Form) WHERE c.formId = f.formId
MERGE (c)-[:PART_OF]->(f)

// form to the first chunk (seq 0) of each section
MATCH (c:Chunk), (f:Form) WHERE c.formId = f.formId AND c.chunkSeqId = 0
MERGE (f)-[s:SECTION {f10kItem: c.item}]->(c)

Name the three relationship types added in this module, their directions, and what each enables. KGR-5.1

NEXT (Chunk → Chunk) — a linked list preserving document order, enabling sequential traversal and chunk windows. PART_OF (Chunk → Form) — document membership, enabling traversal from any chunk back to its filing. SECTION (Form → Chunk, carrying f10kItem) — an entry point to the first chunk of each section, enabling a direct jump to where a section begins.

Chunk windows and variable-length paths

A chunk window retrieves a target chunk plus its neighbours so the LLM sees continuous context. A window of $w = 2k+1$ chunks uses $k$ hops in each direction; the default $k=1$ gives a 3-chunk window. The naive fixed pattern, though, breaks at the ends of the list:

// fixed 3-node window: FAILS at boundaries
MATCH (before:Chunk)-[:NEXT]->(middle:Chunk)-[:NEXT]->(after:Chunk)
WHERE middle.chunkId = $chunkIdParam
RETURN before, middle, after

Key concept

Variable-length paths degrade gracefully at boundaries

KGR-5.3

The first chunk has no predecessor and the last has no successor, so a fixed two-hop window returns zero results there. A variable-length path *0..1 matches zero or one NEXT hops, so the pattern still matches at the ends: at the first chunk before collapses onto middle (0 hops back); at the last, after collapses onto middle. Order by path length descending and LIMIT 1 to keep the longest available window. [V] Verified

MATCH window = (before:Chunk)-[:NEXT*0..1]->(middle:Chunk)-[:NEXT*0..1]->(after:Chunk)
WHERE middle.chunkId = $chunkIdParam
WITH window, length(window) AS windowLength
ORDER BY windowLength DESC
LIMIT 1
RETURN window

What does the *0..1 notation match, and why does it fix the fixed-window failure at the first and last chunks? KGR-5.3

*0..1 matches zero or one NEXT hops (a variable-length path). At a boundary a fixed one-hop side has nothing to match and the whole pattern returns nothing; with *0..1 the missing side simply matches zero hops (the endpoint collapses onto the target), so the query still returns the available neighbours. Ordering by length(window) DESC LIMIT 1 then keeps the longest window that exists at that position.

Enhanced RAG: vector search meets graph traversal

The payoff is a custom retrieval query: Neo4jVector finds the best-matching chunk by vector similarity, then a Cypher tail expands it into a window before the chunks reach the LLM. You pass it via the retrieval_query argument; it receives the matched node and its score:

WITH node, score
MATCH window = (:Chunk)-[:NEXT*0..1]->(node)-[:NEXT*0..1]->(:Chunk)
WITH node, score, window ORDER BY length(window) DESC LIMIT 1
UNWIND nodes(window) AS chunk
RETURN chunk.text AS text, score, {source: chunk.source} AS metadata

Key concept

The combination beats either technique alone

KGR-5.6

Vector search alone finds a relevant point; graph traversal alone has no notion of relevance. Together, vector search locates the entry chunk and the window adds the surrounding context the LLM needs for a complete answer. Concretely: asked for NetApp’s primary business, the single-chunk answer says “enterprise storage and data management”; the windowed answer adds the Keystone product detail that lived in the adjacent chunk. When it breaks: if NEXT crosses sections (missing filter) the window injects irrelevant context, and very large windows ( $k > 3$ ) can dilute the signal past the LLM’s useful attention. [V] Verified

How does a custom retrieval query extend Neo4jVector's default behaviour, and what two values does it receive from the vector search? KGR-5.5

The default retriever returns the matched chunk’s text as-is. A custom retrieval_query runs after the vector search and receives the matched node and its similarity score; from there it can traverse the graph — here following NEXT to build a chunk window — so the LLM gets expanded context, not a lone chunk. It returns text, score, and a metadata map.

When is graph augmentation worth it?

Pattern

Where practitioners disagree — now that the graph exists, is the chunk-window / graph hop worth its latency and cost? The published trade for one support-RAG build: pure vector retrieval ran ~78% accuracy / 62% completeness at ~$200/month and ~120 ms; graph-augmented retrieval ran ~89% / 84% at ~$450/month and ~180 ms — a 22-point completeness jump for roughly +$250/month and +60 ms. The honest rule is the same as Modules 3–4: pay for traversal when questions actually span chunk or document boundaries and quality drives value; skip it for self-contained FAQ lookups. And there’s a third option — a hybrid: run vector search first and trigger the graph expansion only when the top score is low or the question is multi-entity, capturing most of the accuracy gain at closer to the pure-vector cost.

A team gets 78 percent accuracy from pure vector RAG and 89 percent from graph-augmented RAG at higher cost and latency. Describe a hybrid that captures most of the gain more cheaply. KGR-5.6

Run vector search first (the cheap path), then conditionally expand via graph traversal only when a cheap signal says it’s needed — the top similarity score is below a threshold, or the question references multiple entities. The expensive graph window then fires on only the ~30–40% of queries that benefit, capturing most of the accuracy/completeness gain while keeping average cost and latency near the pure-vector baseline.

Summary

Adding NEXT, PART_OF, and SECTION relationships turns a flat vector store into a navigable knowledge graph. NEXT linked lists preserve document order — built per section so they never bleed across boundaries. PART_OF gives every chunk a path back to its form; SECTION gives the form indexed entry points into each section. Variable-length paths (*0..1) make chunk windows degrade gracefully at the list ends, and a custom retrieval query fuses vector search (find the entry chunk) with graph traversal (expand the window) in a single step — the core move of knowledge-graph-powered RAG. Whether to pay for that traversal is a cost/benefit call: worth it when answers span boundaries, and a score-gated hybrid captures most of the gain cheaply.

Chapter 6 — expanding the graph: Company and investor nodes (Form 13 data) join the forms via the CUSIP bridge.
Chapter 7 — answering questions by having an LLM generate Cypher across the connected graph.