Part 1 Chapter 4 Last verified 2026-06-19

Constructing a Knowledge Graph from Text

Building a knowledge graph from SEC Form 10-K filings: extracting sections (Item 1/1A/7/7A), chunking text with LangChain's CharacterTextSplitter, designing chunk metadata, idempotent node creation with uniqueness constraints and MERGE, wiring a Neo4jVector + retrieval-QA RAG chain, and preventing hallucination — ending with why a pure vector store still needs the relationships added in Module 5.

On this page
  1. SEC Form 10-K: the source data
  2. Chunking the text
  3. Chunk metadata
  4. Creating graph nodes idempotently
  5. Adding vector search to the chunks
  6. Building the RAG chain
  7. When RAG hallucinates
  8. The limitation: a pure vector store
  9. Summary

SEC Form 10-K: the source data

Pre-processing extracts the useful fields from the raw XML into JSON. Two identifiers and four text sections do most of the work:

| Field | Type | What it holds | | --- | --- | --- | | CIK | identifier | Central Index Key — the SEC’s company id | | CUSIP | identifier | security id (first 6 digits link a company across datasets) | | Item 1 | text | business description | | Item 1A | text | risk factors | | Item 7 | text | management discussion & analysis | | Item 7A | text | market-risk disclosures |

Name four key sections of a Form 10-K and the two identifiers used to link companies across datasets. KGR-4.1

The four text sections: Item 1 (business description), Item 1A (risk factors), Item 7 (management discussion & analysis), Item 7A (market-risk disclosures). The two linking identifiers: CIK (the SEC’s Central Index Key) and CUSIP (the security identifier; the first 6 digits, cusip6, identify the issuer).

Chunking the text

Each section is far too long to embed whole, so it is split into overlapping chunks with LangChain’s CharacterTextSplitter:

from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
chunks = text_splitter.split_text(item1_text)   # Item 1 alone -> 254 chunks

| Parameter | Value | Why | | --- | --- | --- | | chunk_size | 2000 chars | balances context richness against embedding specificity | | chunk_overlap | 200 chars | ~10% overlap so no context is lost at a boundary |

The number of chunks follows from the effective advance — each new chunk moves forward by chunk_size − overlap, not the full size:

chunksLOSO\text{chunks} \approx \left\lceil \frac{L - O}{S - O} \right\rceil

where LL is text length, SS is chunk size, and OO is overlap.

Why does chunk overlap matter, and what is the effective per-chunk advance at size 2000 with overlap 200? KGR-4.2

Overlap carries context across boundaries, so a sentence split between two chunks still appears whole in at least one of them — without it, a fact straddling a boundary can be lost from both embeddings. The effective advance is chunk_size − overlap = 2000 − 200 = 1800 chars per chunk, which is what drives the chunk count.

Chunk metadata

A chunk is stored as a dictionary — text plus metadata that preserves where it came from and where it sits in the document:

chunk_record = {
    "text": "...",          # the chunk content
    "item": "item1",        # which section of the form
    "chunkSeqId": 0,        # position within the section
    "formId": "...",        # the filing this chunk belongs to
    "chunkId": "...",       # unique id for this chunk
    "source": "...",        # link back to the SEC filing
    "cik": "...",           # company CIK
    "cusip6": "..."         # company CUSIP (first 6 digits)
}
List the metadata fields stored with each chunk and explain why chunkSeqId matters for later modules. KGR-4.3

Fields: text, item (section), chunkSeqId (position in section), formId, chunkId (unique key), source, cik, cusip6. chunkSeqId preserves document order within a section, which Module 5 needs to build NEXT relationships linking each chunk to the next — enabling adjacent-context retrieval that a flat vector store can’t do.

Creating graph nodes idempotently

Before bulk loading, declare a uniqueness constraint. It prevents duplicate chunks and creates an implicit index that makes the next step’s existence checks fast:

CREATE CONSTRAINT unique_chunk IF NOT EXISTS
FOR (c:Chunk) REQUIRE c.chunkId IS UNIQUE

Then load each chunk with MERGE on its stable chunkId, writing properties only on first creation:

MERGE (c:Chunk {chunkId: $chunkParam.chunkId})
ON CREATE SET
  c.text       = $chunkParam.text,
  c.item       = $chunkParam.item,
  c.chunkSeqId = $chunkParam.chunkSeqId,
  c.formId     = $chunkParam.formId
Key concept

Idempotent construction = exactly-once graph state

KGR-4.4

Production pipelines fail and retry, so ingestion is “at-least-once” by nature. An idempotent pipeline turns that into exactly-once graph state: a uniqueness constraint blocks duplicate nodes, MERGE on a stable id matches instead of inserting on re-run, and IF NOT EXISTS makes index/constraint creation safe to repeat. When it breaks: MERGE on a volatile property (a timestamp, a counter) creates a new node every run — always MERGE on stable identifiers only. [V] Verified

Why create a uniqueness constraint before bulk-loading chunks with MERGE, and what does ON CREATE SET guarantee? KGR-4.4

The constraint guarantees no duplicate chunkId and creates an implicit index, so each MERGE does an index lookup instead of scanning all Chunk nodes (linear vs quadratic loading). ON CREATE SET writes the listed properties only when the node is first created — on a retry the node already exists, MERGE matches it, and the SET is skipped, so re-running the pipeline is safe.

Adding vector search to the chunks

The embedding step is exactly the Module 3 pattern, now over chunk text instead of movie taglines: create a vector index on Chunk.textEmbedding, then populate it with genai.vector.encode and db.create.setNodeVectorProperty. The index declares 1536 dimensions (OpenAI’s text-embedding-3-small) and cosine similarity, and must reach ONLINE before you query it.

CREATE VECTOR INDEX form10kChunks IF NOT EXISTS
FOR (c:Chunk) ON (c.textEmbedding)
OPTIONS { indexConfig: {
  `vector.dimensions`: 1536,
  `vector.similarity_function`: 'cosine'
}}

Building the RAG chain

With chunks embedded, LangChain wraps the graph as a vector store and chains retrieval to an LLM:

Neo4jVector.from_existing_graph connects to the index you already built — you tell it the label, the text property, and the embedding property:

from langchain_neo4j import Neo4jVector          # see currency note below
from langchain_openai import OpenAIEmbeddings

neo4j_vector = Neo4jVector.from_existing_graph(
    OpenAIEmbeddings(),
    url=NEO4J_URI, username=NEO4J_USERNAME, password=NEO4J_PASSWORD,
    index_name="form10kChunks",
    node_label="Chunk",
    text_node_properties=["text"],
    embedding_node_property="textEmbedding",
)
retriever = neo4j_vector.as_retriever()

The retriever feeds a retrieval-QA chain that stuffs the retrieved chunks into the prompt and asks an LLM (temperature 0) to answer. Ask “What is NetApp’s primary business?” and it returns “enterprise storage and data management” — grounded in the filing.

What three things does Neo4jVector.from_existing_graph need to connect to an existing index, and what are the RAG chain's three stages? KGR-4.7

It needs the index name, the node label (Chunk), and which properties hold the text (text_node_properties) and the embedding (embedding_node_property) — plus the connection URL/credentials and an embeddings object. The three stages are connect (wrap Neo4j as a vector store), retrieve (encode the question, get top-K nearest chunks), and generate (LLM answers using those chunks as context).

When RAG hallucinates

Ask the same system “Tell me about Apple” — a company not in the data — and it answers confidently about Apple anyway, describing it with NetApp’s business. That is hallucination: the retriever returned the nearest available chunks (NetApp’s), and the prompt never told the model it could decline, so it filled the gap with plausible-but-wrong text.

The fix is layered — prompt engineering plus a retrieval check:

  1. Instruct refusal — “If the context doesn’t contain the answer, say you don’t know.” 2. Scope the context — state that it comes from specific filings only. 3. Gate on retrieval score — low similarity scores signal an out-of-scope question; drop the chunks before they reach the LLM.
template = """Use ONLY the following context to answer.
The context comes from SEC filings for specific companies.
If the context does not contain information about the company
or topic in the question, respond: "I don't have information
about that company in my data." Do NOT use general knowledge.

Context: {context}
Question: {question}
Answer:"""
A NetApp-only RAG system answers a question about Apple using NetApp's data. Name the failure mode and two mitigations. KGR-4.6

This is entity mis-attribution (applying retrieved entity-A context to entity B) — the retriever returned NetApp’s nearest chunks and the LLM answered anyway. Mitigations (any two): a prompt that says “use ONLY this context” and “say you don’t know” for missing entities; scoping the context to named filings; and retrieval-score gating — drop low-similarity chunks so out-of-scope questions get no context to misuse.

The limitation: a pure vector store

Key concept

Disconnected chunks are not yet a knowledge graph

KGR-4.8

At this point Neo4j holds embedded Chunk nodes and nothing else — no relationships. Functionally it is a vector store that happens to live in a graph database. That means you cannot: follow a chunk back to its source filing, retrieve the adjacent chunk for expanded context, or traverse from a chunk to the company that filed it. Module 5 adds NEXT, PART_OF, and SECTION relationships to turn this flat collection into a true knowledge graph. [V] Verified

Name three things a pure vector store of chunks cannot do that adding relationships would enable. KGR-4.8

Without relationships you cannot: (1) follow a chunk back to its source document/filing; (2) retrieve the adjacent chunk (the NEXT one) for expanded context; (3) traverse from a chunk to related entities — the company that filed it, other filings, investors. All three require the NEXT/PART_OF/SECTION relationships added in Module 5.

Summary

SEC Form 10-K filings are rich, standardised documents — ideal source text. The construction pipeline is: extract sections (Item 1/1A/7/7A) → chunk with overlap so no context is lost → store each chunk with provenance metadata → MERGE into the graph behind a uniqueness constraint for idempotent, exactly-once loading → embed and index for vector search → wrap with Neo4jVector and a retrieval-QA chain. Prompt engineering (“use only this context; say you don’t know”) plus retrieval-score gating is what keeps the system from hallucinating on out-of-scope questions. But at this stage Neo4j is only a vector store: the chunks are disconnected.

  • Chapter 5adding relationships: NEXT, PART_OF, and SECTION turn the chunk collection into a real knowledge graph.
  • Chapter 6 — expanding the SEC graph with companies and investors (Form 13 data).
  • Chapter 7 — answering questions by having an LLM generate Cypher against the connected graph.