Constructing a Knowledge Graph from Text
Building a knowledge graph from SEC Form 10-K filings: extracting sections (Item 1/1A/7/7A), chunking text with LangChain's CharacterTextSplitter, designing chunk metadata, idempotent node creation with uniqueness constraints and MERGE, wiring a Neo4jVector + retrieval-QA RAG chain, and preventing hallucination — ending with why a pure vector store still needs the relationships added in Module 5.
On this page
SEC Form 10-K: the source data
Pre-processing extracts the useful fields from the raw XML into JSON. Two identifiers and four text sections do most of the work:
| Field | Type | What it holds | | --- | --- | --- | | CIK | identifier | Central Index Key — the SEC’s company id | | CUSIP | identifier | security id (first 6 digits link a company across datasets) | | Item 1 | text | business description | | Item 1A | text | risk factors | | Item 7 | text | management discussion & analysis | | Item 7A | text | market-risk disclosures |
Name four key sections of a Form 10-K and the two identifiers used to link companies across datasets. KGR-4.1
The four text sections: Item 1 (business description), Item 1A (risk factors), Item 7 (management discussion & analysis), Item 7A (market-risk disclosures). The two linking identifiers: CIK (the SEC’s Central Index Key) and CUSIP (the security identifier; the first 6 digits, cusip6, identify the issuer).
Chunking the text
Each section is far too long to embed whole, so it is split into overlapping
chunks with LangChain’s
CharacterTextSplitter:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
chunks = text_splitter.split_text(item1_text) # Item 1 alone -> 254 chunks
| Parameter | Value | Why |
| --- | --- | --- |
| chunk_size | 2000 chars | balances context richness against embedding specificity |
| chunk_overlap | 200 chars | ~10% overlap so no context is lost at a boundary |
The number of chunks follows from the effective advance — each new chunk moves
forward by chunk_size − overlap, not the full size:
where is text length, is chunk size, and is overlap.
Why does chunk overlap matter, and what is the effective per-chunk advance at size 2000 with overlap 200? KGR-4.2
Overlap carries context across boundaries, so a sentence split between two chunks still appears whole in at least one of them — without it, a fact straddling a boundary can be lost from both embeddings. The effective advance is chunk_size − overlap = 2000 − 200 = 1800 chars per chunk, which is what drives the chunk count.
Chunk metadata
A chunk is stored as a dictionary — text plus metadata that preserves where it came from and where it sits in the document:
chunk_record = {
"text": "...", # the chunk content
"item": "item1", # which section of the form
"chunkSeqId": 0, # position within the section
"formId": "...", # the filing this chunk belongs to
"chunkId": "...", # unique id for this chunk
"source": "...", # link back to the SEC filing
"cik": "...", # company CIK
"cusip6": "..." # company CUSIP (first 6 digits)
}
List the metadata fields stored with each chunk and explain why chunkSeqId matters for later modules. KGR-4.3
Fields: text, item (section), chunkSeqId (position in section), formId, chunkId (unique key), source, cik, cusip6. chunkSeqId preserves document order within a section, which Module 5 needs to build NEXT relationships linking each chunk to the next — enabling adjacent-context retrieval that a flat vector store can’t do.
Creating graph nodes idempotently
Before bulk loading, declare a uniqueness constraint. It prevents duplicate chunks and creates an implicit index that makes the next step’s existence checks fast:
CREATE CONSTRAINT unique_chunk IF NOT EXISTS
FOR (c:Chunk) REQUIRE c.chunkId IS UNIQUE
Then load each chunk with MERGE on its stable
chunkId, writing properties only on first creation:
MERGE (c:Chunk {chunkId: $chunkParam.chunkId})
ON CREATE SET
c.text = $chunkParam.text,
c.item = $chunkParam.item,
c.chunkSeqId = $chunkParam.chunkSeqId,
c.formId = $chunkParam.formId
Idempotent construction = exactly-once graph state
KGR-4.4Production pipelines fail and retry, so ingestion is “at-least-once” by nature.
An idempotent pipeline turns that
into exactly-once graph state: a uniqueness constraint blocks duplicate
nodes, MERGE on a stable id matches instead of inserting on re-run, and
IF NOT EXISTS makes index/constraint creation safe to repeat. When it
breaks: MERGE on a volatile property (a timestamp, a counter) creates a new
node every run — always MERGE on stable identifiers only. [V] Verified
Why create a uniqueness constraint before bulk-loading chunks with MERGE, and what does ON CREATE SET guarantee? KGR-4.4
The constraint guarantees no duplicate chunkId and creates an implicit index, so each MERGE does an index lookup instead of scanning all Chunk nodes (linear vs quadratic loading). ON CREATE SET writes the listed properties only when the node is first created — on a retry the node already exists, MERGE matches it, and the SET is skipped, so re-running the pipeline is safe.
Adding vector search to the chunks
The embedding step is exactly the Module 3 pattern,
now over chunk text instead of movie taglines: create a vector index on
Chunk.textEmbedding, then populate it with genai.vector.encode and
db.create.setNodeVectorProperty. The index declares 1536 dimensions
(OpenAI’s text-embedding-3-small) and cosine similarity, and must reach
ONLINE before you query it.
CREATE VECTOR INDEX form10kChunks IF NOT EXISTS
FOR (c:Chunk) ON (c.textEmbedding)
OPTIONS { indexConfig: {
`vector.dimensions`: 1536,
`vector.similarity_function`: 'cosine'
}}
Building the RAG chain
With chunks embedded, LangChain wraps the graph as a vector store and chains retrieval to an LLM:
Neo4jVector.from_existing_graph connects to the
index you already built — you tell it the label, the text property, and the
embedding property:
from langchain_neo4j import Neo4jVector # see currency note below
from langchain_openai import OpenAIEmbeddings
neo4j_vector = Neo4jVector.from_existing_graph(
OpenAIEmbeddings(),
url=NEO4J_URI, username=NEO4J_USERNAME, password=NEO4J_PASSWORD,
index_name="form10kChunks",
node_label="Chunk",
text_node_properties=["text"],
embedding_node_property="textEmbedding",
)
retriever = neo4j_vector.as_retriever()
The retriever feeds a retrieval-QA chain that stuffs the retrieved chunks into the prompt and asks an LLM (temperature 0) to answer. Ask “What is NetApp’s primary business?” and it returns “enterprise storage and data management” — grounded in the filing.
What three things does Neo4jVector.from_existing_graph need to connect to an existing index, and what are the RAG chain's three stages? KGR-4.7
It needs the index name, the node label (Chunk), and which properties hold the text (text_node_properties) and the embedding (embedding_node_property) — plus the connection URL/credentials and an embeddings object. The three stages are connect (wrap Neo4j as a vector store), retrieve (encode the question, get top-K nearest chunks), and generate (LLM answers using those chunks as context).
When RAG hallucinates
Ask the same system “Tell me about Apple” — a company not in the data — and it answers confidently about Apple anyway, describing it with NetApp’s business. That is hallucination: the retriever returned the nearest available chunks (NetApp’s), and the prompt never told the model it could decline, so it filled the gap with plausible-but-wrong text.
The fix is layered — prompt engineering plus a retrieval check:
- Instruct refusal — “If the context doesn’t contain the answer, say you don’t know.” 2. Scope the context — state that it comes from specific filings only. 3. Gate on retrieval score — low similarity scores signal an out-of-scope question; drop the chunks before they reach the LLM.
template = """Use ONLY the following context to answer.
The context comes from SEC filings for specific companies.
If the context does not contain information about the company
or topic in the question, respond: "I don't have information
about that company in my data." Do NOT use general knowledge.
Context: {context}
Question: {question}
Answer:"""
A NetApp-only RAG system answers a question about Apple using NetApp's data. Name the failure mode and two mitigations. KGR-4.6
This is entity mis-attribution (applying retrieved entity-A context to entity B) — the retriever returned NetApp’s nearest chunks and the LLM answered anyway. Mitigations (any two): a prompt that says “use ONLY this context” and “say you don’t know” for missing entities; scoping the context to named filings; and retrieval-score gating — drop low-similarity chunks so out-of-scope questions get no context to misuse.
The limitation: a pure vector store
Disconnected chunks are not yet a knowledge graph
KGR-4.8At this point Neo4j holds embedded Chunk nodes and nothing else — no
relationships. Functionally it is a vector store that happens to live in a graph
database. That means you cannot: follow a chunk back to its source filing,
retrieve the adjacent chunk for expanded context, or traverse from a chunk to
the company that filed it. Module 5 adds NEXT, PART_OF, and SECTION
relationships to turn this flat collection into a true knowledge graph. [V] Verified
Name three things a pure vector store of chunks cannot do that adding relationships would enable. KGR-4.8
Without relationships you cannot: (1) follow a chunk back to its source document/filing; (2) retrieve the adjacent chunk (the NEXT one) for expanded context; (3) traverse from a chunk to related entities — the company that filed it, other filings, investors. All three require the NEXT/PART_OF/SECTION relationships added in Module 5.
Summary
SEC Form 10-K filings are rich, standardised documents — ideal source text. The
construction pipeline is: extract sections (Item 1/1A/7/7A) → chunk with overlap
so no context is lost → store each chunk with provenance metadata → MERGE into
the graph behind a uniqueness constraint for idempotent, exactly-once loading →
embed and index for vector search → wrap with Neo4jVector and a retrieval-QA
chain. Prompt engineering (“use only this context; say you don’t know”) plus
retrieval-score gating is what keeps the system from hallucinating on
out-of-scope questions. But at this stage Neo4j is only a vector store: the
chunks are disconnected.
- Chapter 5 — adding relationships:
NEXT,PART_OF, andSECTIONturn the chunk collection into a real knowledge graph. - Chapter 6 — expanding the SEC graph with companies and investors (Form 13 data).
- Chapter 7 — answering questions by having an LLM generate Cypher against the connected graph.