Part 1 Chapter 6 Last verified 2026-06-19

Expanding the Knowledge Graph

Widening the graph's scope with a second dataset — SEC Form 13 investment holdings: Company and Manager nodes, the CUSIP bridge that links them to existing forms, FILED and OWNS_STOCK_IN relationships, full-text indexes alongside vector search, multi-hop traversal from chunk to manager, graph-to-text context, and when investment-enhanced RAG actually helps.

On this page

A second dataset: SEC Form 13
Linking the datasets with CUSIP
Full-text search alongside vectors
Investment relationships and the complete schema
Multi-hop traversal and graph-to-text
When does investment context actually help?
Summary

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Linking the datasets with CUSIP; if any is shaky, read closely — each is developed below.

Predict before reading — check each against the chapter.

Form 13 (investment) data was created completely separately from the Form 10-K graph. Predict the single property that lets you connect a new Company node to the right existing Form without manual matching.
You want to find managers named like “Royal Bank” by string match, not by meaning. Predict which index type Neo4j uses — vector or full-text.
You add investor context to every RAG answer. Asked “Tell me about NetApp’s business,” predict whether the investor data improves the answer.
Starting from a text chunk, predict the hop sequence that reaches the investment managers who own the filing company.

Check your answers

CUSIP 6 — both datasets carry it, so MATCH (com:Company), (f:Form) WHERE com.cusip6 = f.cusip6 links them automatically. Shared identifiers are the cross-dataset bridge.
Full-text index — string/keyword matching (partial, fuzzy, relevance-scored). Vector search matches meaning, not spelling.
No — the question doesn’t ask about investors, so the extra context is ignored as noise. Graph context helps only when the question needs it.
Chunk → Form (PART_OF), Form ← Company (FILED), Company ← Manager (OWNS_STOCK_IN) — a four-hop path from chunk to manager.

A second dataset: SEC Form 13

Form 13 is filed quarterly by institutional investment firms to report their public-company holdings. Each row carries a manager (name, CIK, address), a company (name, cusip6), and the position (shares, value, quarter). The course dataset has 561 rows — all investments in NetApp.

MERGE (com:Company {cusip6: $p.cusip6})
ON CREATE SET com.companyName = $p.companyName, com.cusip = $p.cusip

MERGE (mgr:Manager {managerCik: $p.managerCik})
ON CREATE SET mgr.managerName = $p.managerName, mgr.managerAddress = $p.managerAddress

Linking the datasets with CUSIP

The Form 10-K graph (Module 4) and this Form 13 data were built independently — yet they connect with no manual matching, because both carry a CUSIP. Match on the shared cusip6 and create a FILED edge:

MATCH (com:Company), (f:Form) WHERE com.cusip6 = f.cusip6
MERGE (com)-[:FILED]->(f)

Key concept

Shared identifiers are the cross-dataset bridge

KGR-6.7

CUSIP 6 is a universal identifier present in both datasets, so MERGE matches Company to Form automatically — the standard pattern for expanding a knowledge graph with new sources. When it breaks: if the two datasets use different identifier schemes (CUSIP here, ISIN or ticker there), automatic linking fails and you need an entity-resolution layer to reconcile identifiers before MERGE can match. The hard part of integration is rarely the join — it’s agreeing on the key. [V] Verified

When the linking key isn't universal Worked example

Problem. You integrate patent filings (assignee = company name) with an employee database (employer = company name). Why is company name a poor linking key, and what’s better?

Reasoning. Names aren’t standardised: “Google LLC”, “Alphabet Inc.”, and “Google” may be one entity, producing missed matches (false negatives) and wrong matches (false positives). CUSIP worked in this chapter precisely because it’s a canonical id. Without one, build entity resolution: a canonical id where it exists (DUNS, SEC CIK), else fuzzy-match names, verify with secondary attributes (address, industry), and store a canonical node with an aliases property.

Answer. Raw name strings are unreliable keys. Prefer a canonical identifier; where none exists, an entity-resolution pipeline — fuzzy match plus attribute verification — is the (often hardest) integration step.

Why can Company and Form nodes be linked automatically here, and what is needed when two datasets use different identifier schemes? KGR-6.7

Both datasets carry the CUSIP 6 identifier, so MATCH ... WHERE com.cusip6 = f.cusip6 MERGE (com)-[:FILED]->(f) connects them with no manual matching — a shared universal key is the bridge. When datasets use different schemes (CUSIP vs ISIN vs ticker), there’s no common key to MERGE on, so you need an entity-resolution layer that maps identifiers (or fuzzy-matches names + verifies with secondary attributes) before linking.

Full-text search alongside vectors

Vector search matches meaning; to match strings — find a manager named like “Royal Bank” — you need a full-text index:

CREATE FULLTEXT INDEX fullTextManagerNames IF NOT EXISTS
FOR (m:Manager) ON EACH [m.managerName]

CALL db.index.fulltext.queryNodes('fullTextManagerNames', 'Royal Bank')
  YIELD node, score
RETURN node.managerName, score

Neo4j now offers three search paradigms in one engine — pick by what you’re matching:

| Search | Index | Matches on | | --- | --- | --- | | Semantic similarity | vector | meaning (conceptually similar text) | | Keyword / string | full-text | spelling (exact, partial, fuzzy) | | Exact lookup | property | a known identifier |

How does a full-text index differ from a vector index, and when would you reach for each? KGR-6.3

A vector index finds semantically similar content (close embeddings) — good for “find chunks about this topic”. A full-text index does string matching — exact, partial, and fuzzy keyword search with relevance scoring — good for “find the manager named like Royal Bank”, where spelling, not meaning, is the query. They’re complementary; a third option, a property index, handles exact lookup by a known identifier.

Investment relationships and the complete schema

Each CSV row becomes an OWNS_STOCK_IN edge from manager to company, carrying shares, value, and the reporting quarter — 561 of them, one per holding:

MATCH (mgr:Manager {managerCik: $p.managerCik}), (com:Company {cusip6: $p.cusip6})
MERGE (mgr)-[owns:OWNS_STOCK_IN {reportCalendarOrQuarter: $p.quarter}]->(com)
ON CREATE SET owns.value = $p.value, owns.shares = $p.shares

Multi-hop traversal and graph-to-text

Now a single query crosses both datasets — from a text chunk all the way to the investors in its company:

To feed traversal results to an LLM, convert each result row into a sentence — graph-to-text context generation:

sentence = f"{r['managerName']} owns {r['shares']} shares of {r['companyName']}"
context_sentences.append(sentence)

Trace the four hops from a text chunk to an investment manager, naming each relationship and its direction. KGR-6.4

From a chunk: follow PART_OF forward to its Form; follow FILED backward (Form <-[:FILED]- Company) to the Company; follow OWNS_STOCK_IN backward (Company <-[:OWNS_STOCK_IN]- Manager) to each Manager. So (:Chunk)-[:PART_OF]->(:Form)<-[:FILED]-(:Company)<-[:OWNS_STOCK_IN]-(:Manager) — chunk → form → company → manager, which counted NetApp’s 561 investors.

When does investment context actually help?

Two retrieval chains — plain vs investment-enhanced — answer differently depending on the question:

| Question | Plain chain | Investment chain | | --- | --- | --- | | “Tell me about NetApp” | general business description | ~the same (investor data not used — not asked) | | “Who are NetApp’s investors?” | vague (“diversified customer base”) | specific investor names and share counts |

Key concept

Question–context alignment: targeted beats broad

KGR-6.6

Graph-augmented context helps only when the question matches the added data. Asking about investors with investment context produces specific, accurate answers; asking a general question with that same context adds noise the LLM ignores — paying traversal cost (latency, tokens) for no gain. The design rule: expand context to match the expected question distribution, not to add everything you can reach. [V] Verified

Quantifying the payoff Worked example

Problem. An FAQ system sits at 72% satisfaction (target 85%). Of its 28% failures, 40% are product-mismatch and 35% are missing-comparison. If a graph schema cuts each of those two modes by 60%, does it hit target?

Reasoning. Mode 1 is $28\% \times 40\% = 11.2\%$ of all answers; Mode 2 is $28\% \times 35\% = 9.8\%$ . A 60% cut removes $0.6 \times 11.2\% = 6.7\%$ and $0.6 \times 9.8\% = 5.9\%$ . New failure rate: $28\% - 6.7\% - 5.9\% = 15.4\%$ .

Answer. Satisfaction rises to ≈ 84.6%, essentially hitting the 85% target. Note the ceiling: only the 75% of failures from these two modes is addressable — the remaining 25% needs different fixes.

Summary

Form 13 data widens the graph from documents to the entities behind them. CUSIP 6 is the linking key that joins independently built Company and Form nodes — the cross-dataset pattern, with entity resolution as its fallback when no universal id exists. Full-text indexes add keyword search beside vector (semantic) and property (exact) search — three paradigms in one engine. OWNS_STOCK_IN and FILED edges make the graph multi-hop: one query runs chunk → form → company → manager, and graph-to-text turns the result into LLM context. But the payoff is conditional — graph context lifts answers only when the question aligns with the data you added.

Chapter 7 — the capstone: an LLM generates its own Cypher from a natural-language question, closing the loop from question to query to graph to answer.