Part 1 Chapter 6 Last verified 2026-06-19

Expanding the Knowledge Graph

Widening the graph's scope with a second dataset — SEC Form 13 investment holdings: Company and Manager nodes, the CUSIP bridge that links them to existing forms, FILED and OWNS_STOCK_IN relationships, full-text indexes alongside vector search, multi-hop traversal from chunk to manager, graph-to-text context, and when investment-enhanced RAG actually helps.

On this page
  1. A second dataset: SEC Form 13
  2. Linking the datasets with CUSIP
  3. Full-text search alongside vectors
  4. Investment relationships and the complete schema
  5. Multi-hop traversal and graph-to-text
  6. When does investment context actually help?
  7. Summary

A second dataset: SEC Form 13

Form 13 is filed quarterly by institutional investment firms to report their public-company holdings. Each row carries a manager (name, CIK, address), a company (name, cusip6), and the position (shares, value, quarter). The course dataset has 561 rows — all investments in NetApp.

MERGE (com:Company {cusip6: $p.cusip6})
ON CREATE SET com.companyName = $p.companyName, com.cusip = $p.cusip

MERGE (mgr:Manager {managerCik: $p.managerCik})
ON CREATE SET mgr.managerName = $p.managerName, mgr.managerAddress = $p.managerAddress

Linking the datasets with CUSIP

The Form 10-K graph (Module 4) and this Form 13 data were built independently — yet they connect with no manual matching, because both carry a CUSIP. Match on the shared cusip6 and create a FILED edge:

MATCH (com:Company), (f:Form) WHERE com.cusip6 = f.cusip6
MERGE (com)-[:FILED]->(f)
Key concept

Shared identifiers are the cross-dataset bridge

KGR-6.7

CUSIP 6 is a universal identifier present in both datasets, so MERGE matches Company to Form automatically — the standard pattern for expanding a knowledge graph with new sources. When it breaks: if the two datasets use different identifier schemes (CUSIP here, ISIN or ticker there), automatic linking fails and you need an entity-resolution layer to reconcile identifiers before MERGE can match. The hard part of integration is rarely the join — it’s agreeing on the key. [V] Verified

Why can Company and Form nodes be linked automatically here, and what is needed when two datasets use different identifier schemes? KGR-6.7

Both datasets carry the CUSIP 6 identifier, so MATCH ... WHERE com.cusip6 = f.cusip6 MERGE (com)-[:FILED]->(f) connects them with no manual matching — a shared universal key is the bridge. When datasets use different schemes (CUSIP vs ISIN vs ticker), there’s no common key to MERGE on, so you need an entity-resolution layer that maps identifiers (or fuzzy-matches names + verifies with secondary attributes) before linking.

Full-text search alongside vectors

Vector search matches meaning; to match strings — find a manager named like “Royal Bank” — you need a full-text index:

CREATE FULLTEXT INDEX fullTextManagerNames IF NOT EXISTS
FOR (m:Manager) ON EACH [m.managerName]
CALL db.index.fulltext.queryNodes('fullTextManagerNames', 'Royal Bank')
  YIELD node, score
RETURN node.managerName, score

Neo4j now offers three search paradigms in one engine — pick by what you’re matching:

| Search | Index | Matches on | | --- | --- | --- | | Semantic similarity | vector | meaning (conceptually similar text) | | Keyword / string | full-text | spelling (exact, partial, fuzzy) | | Exact lookup | property | a known identifier |

How does a full-text index differ from a vector index, and when would you reach for each? KGR-6.3

A vector index finds semantically similar content (close embeddings) — good for “find chunks about this topic”. A full-text index does string matching — exact, partial, and fuzzy keyword search with relevance scoring — good for “find the manager named like Royal Bank”, where spelling, not meaning, is the query. They’re complementary; a third option, a property index, handles exact lookup by a known identifier.

Investment relationships and the complete schema

Each CSV row becomes an OWNS_STOCK_IN edge from manager to company, carrying shares, value, and the reporting quarter — 561 of them, one per holding:

MATCH (mgr:Manager {managerCik: $p.managerCik}), (com:Company {cusip6: $p.cusip6})
MERGE (mgr)-[owns:OWNS_STOCK_IN {reportCalendarOrQuarter: $p.quarter}]->(com)
ON CREATE SET owns.value = $p.value, owns.shares = $p.shares

Multi-hop traversal and graph-to-text

Now a single query crosses both datasets — from a text chunk all the way to the investors in its company:

To feed traversal results to an LLM, convert each result row into a sentence — graph-to-text context generation:

sentence = f"{r['managerName']} owns {r['shares']} shares of {r['companyName']}"
context_sentences.append(sentence)
Trace the four hops from a text chunk to an investment manager, naming each relationship and its direction. KGR-6.4

From a chunk: follow PART_OF forward to its Form; follow FILED backward (Form <-[:FILED]- Company) to the Company; follow OWNS_STOCK_IN backward (Company <-[:OWNS_STOCK_IN]- Manager) to each Manager. So (:Chunk)-[:PART_OF]->(:Form)<-[:FILED]-(:Company)<-[:OWNS_STOCK_IN]-(:Manager) — chunk → form → company → manager, which counted NetApp’s 561 investors.

When does investment context actually help?

Two retrieval chains — plain vs investment-enhanced — answer differently depending on the question:

| Question | Plain chain | Investment chain | | --- | --- | --- | | “Tell me about NetApp” | general business description | ~the same (investor data not used — not asked) | | “Who are NetApp’s investors?” | vague (“diversified customer base”) | specific investor names and share counts |

Key concept

Question–context alignment: targeted beats broad

KGR-6.6

Graph-augmented context helps only when the question matches the added data. Asking about investors with investment context produces specific, accurate answers; asking a general question with that same context adds noise the LLM ignores — paying traversal cost (latency, tokens) for no gain. The design rule: expand context to match the expected question distribution, not to add everything you can reach. [V] Verified

Summary

Form 13 data widens the graph from documents to the entities behind them. CUSIP 6 is the linking key that joins independently built Company and Form nodes — the cross-dataset pattern, with entity resolution as its fallback when no universal id exists. Full-text indexes add keyword search beside vector (semantic) and property (exact) search — three paradigms in one engine. OWNS_STOCK_IN and FILED edges make the graph multi-hop: one query runs chunk → form → company → manager, and graph-to-text turns the result into LLM context. But the payoff is conditional — graph context lifts answers only when the question aligns with the data you added.

  • Chapter 7 — the capstone: an LLM generates its own Cypher from a natural-language question, closing the loop from question to query to graph to answer.