Part 1 Chapter 1 Last verified 2026-06-19

Knowledge Graph Fundamentals

The graph data model — nodes, relationships, properties, labels — and how it maps to graph theory; why graph traversal complements vector similarity search for RAG; "identity through relationships"; knowledge graphs vs tabular storage; and a first taste of Cypher text notation.

On this page

Why knowledge graphs matter for RAG
The graph data model
Graph-theory equivalences
Knowledge graphs vs tabular storage
A first taste of Cypher
Where knowledge graphs are used
The course roadmap
Summary

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Knowledge graphs vs tabular storage; if any is shaky, read closely — each is developed below.

Predict before reading — check each against the chapter.

A user asks “who invested in NetApp?” Predict why plain vector (similarity) search struggles, even with perfect embeddings.
In a graph, a Person can be an actor, a director, or a writer. Guess whether you’d model those as three node labels or three relationships.
Compare a 3-hop “friends of friends of friends” query in SQL vs a graph. Predict which stays cheap as hops grow.
You have 10M support tickets, each with exactly one agent and one customer, and you want “find similar past tickets.” Predict whether a knowledge graph earns its keep here.

Check your answers

Similarity finds text like your question, but “who invested” requires following connections (NetApp → filings → investors) — a structural traversal embeddings can’t perform.
Relationships — a person is an actor because they have an ACTED_IN relationship. Role comes from connections, not labels (“identity through relationships”).
The graph — traversal is roughly $O(k^d)$ in the local fan-out, while SQL needs $d$ self-JOINs ( $O(n^d)$ -ish) that blow up with depth.
No — the relationships are simple (one agent, one customer) and the real need is similarity; a relational DB + vector index is simpler. A KG earns its keep on connected, multi-hop data.

Why knowledge graphs matter for RAG

In a basic RAG system documents become chunks, chunks become vectors, and retrieval is nearest-neighbour by cosine similarity. That finds semantically similar text well — but it cannot discover structural connections between entities. Store those chunks in a knowledge graph and a new option opens: retrieve one chunk, then traverse relationships to reach connected chunks and entities that similarity search would never surface. [V] Verified

Name two retrieval strategies a knowledge graph enables beyond vector similarity search, and the kind of question each answers. KGR-1.2

Graph traversal (follow typed relationships from a starting node to connected entities — answers “what is connected to this?”) and multi-hop expansion (chain several relationships to reach indirectly-related context — answers “what is connected to the things connected to this?”). Vector search only answers “what text is similar?”, so the graph adds connection-based retrieval that embedding distance can’t.

The graph data model

Nodes are data records representing entities. In Cypher text notation they’re written in parentheses: (Person). Each node carries one or more labels (e.g. Person, Company, Movie) that group it with similar entities, plus zero or more properties.
Relationships connect two nodes and are themselves rich records: a start and end node, a type (ACTED_IN, KNOWS, OWNS_STOCK_IN), a direction, and optional properties. Written in square brackets with arrows.

// Person named Andreas KNOWS Person named Andrew, since 2024
(Person {name: "Andreas"})-[:KNOWS {since: 2024}]->(Person {name: "Andrew"})

// Patterns chain to express richer scenarios:
(Person {name: "Andreas"})-[:TEACHES]->(Course {title: "KG for RAG"})

Key concept

Identity through relationships

KGR-1.3

A node’s role comes from its relationships, not from extra labels. A Person is an actor because it has an ACTED_IN relationship, a director because it has DIRECTED, a writer because it has WROTE — no separate Actor/Director/Writer labels needed. Direction and type are the semantics. When it breaks: when a role needs role-specific properties (an actor’s per-film salary vs a director’s budget), relationship-only identity pushes those onto the relationship, which can get unwieldy at scale. [V] Verified

Explain 'identity through relationships'. Why does a Person node not need separate Actor/Director labels? KGR-1.3

A node’s role is encoded by the typed relationships it participates in, not by a label or type field. A Person with an ACTED_IN relationship is an actor; the same node with a DIRECTED relationship is a director — so adding Actor/Director labels would be redundant with the relationships and risk a “label explosion.” Direction and relationship type carry the semantic meaning.

Graph-theory equivalences

The vocabulary maps directly onto classic graph theory — interviewers use both, so know each: [V] Verified

| Graph theory | KG term | Why the KG term is preferred | | --- | --- | --- | | Vertex | Node | More intuitive for data modelling | | Edge | Relationship | Conveys richness (type, direction, properties) | | Graph | Knowledge graph | Emphasises semantic meaning and structure | | Adjacency | Traversal | Describes following relationships through data |

Knowledge graphs vs tabular storage

The core trade-off is relationship complexity, not data size. A relational database represents relationships with foreign keys and reconstructs them with JOINs; a graph makes relationships first-class, so multi-hop questions become traversals instead of nested JOINs.

| Dimension | Relational DB | Knowledge graph | | --- | --- | --- | | Schema | Fixed (DDL required) | Flexible (schema-optional) | | Relationships | Foreign keys + JOINs | First-class records | | Multi-hop queries | Expensive (nested JOINs) | Natural (traversal) | | Vector search | Bolt-on (e.g. pgvector) | Native (vector indexes) | | Query language | SQL | Cypher | | Best for | Transactional, tabular | Connected, semantic data |

The cost intuition: a $d$ -hop traversal touches roughly $O(k^d)$ nodes where $k$ is the average fan-out per node — but it only walks the local neighbourhood, whereas the SQL equivalent needs $d$ self-JOINs over the whole table.

Knowledge graph or relational database? Worked example

Problem. A customer-support system has 10M tickets, each assigned to exactly one agent and one customer, with a status field; you need “find similar past tickets.” KG or relational? Justify on schema flexibility, relationship complexity, and query pattern.

Reasoning.

Schema flexibility: the ticket schema is stable (ticket/agent/customer/status); new fields are a cheap ALTER TABLE. No KG advantage.
Relationship complexity: each ticket has one agent and one customer — simple foreign keys, no many-to-many or multi-hop. No KG advantage.
Query pattern: “find similar tickets” is semantic similarity, best served by a vector index, not traversal.

Answer. Use a relational database (e.g. PostgreSQL + pgvector). The data is tabular with simple relationships and the real need is similarity, not connection. A knowledge graph would add complexity without benefit. The lesson: choose the KG for relationship complexity, not for novelty — when relationships are simple look-ups, relational + vector is simpler and battle-tested.

On which single axis does the knowledge-graph-vs-relational decision most turn, and which way does each side point? KGR-1.4

Relationship complexity. If the data is highly interconnected with many-to-many or multi-hop relationships (and you query across them), the graph wins — traversal stays cheap where SQL JOINs explode. If relationships are simple foreign keys and the workload is tabular or pure similarity, a relational database (plus a vector index) is simpler and sufficient. Schema flexibility and native vector+graph queries are secondary tilts toward the graph.

A first taste of Cypher

Cypher is Neo4j’s declarative, pattern-matching query language: you draw the pattern you want and it returns every matching subgraph. The notation mirrors the whiteboard — () for nodes, [] for relationships, -> for direction. [V] Verified

// Find all people who acted in a movie
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
RETURN p.name, m.title

Read it as: find a Person connected to a Movie by an ACTED_IN relationship, and return their names. The pattern syntax is the graph structure, which makes Cypher queries self-documenting (Module 2 is a full Cypher tutorial; Module 7 has an LLM generate Cypher automatically).

Model a domain as a graph Worked example

Problem. Model a university catalogue: professors teach courses, students enrol in courses, courses belong to departments, some professors advise students. Give node labels, relationship types, and the Cypher for “Dr. Smith teaches Machine Learning 101 in the CS department.” Can a TA be modelled by relationships alone?

Reasoning. Entities → labels with properties; verbs → typed directed relationships.

(smith:Professor {name: "Dr. Smith"})-[:TEACHES]->
(ml101:Course {title: "Machine Learning 101"})-[:BELONGS_TO]->
(cs:Department {name: "CS"})

Labels: Professor (name, email), Student (name, studentId), Course (title, code, credits), Department (name). Relationships: TEACHES, ENROLLED_IN, BELONGS_TO, ADVISES.

Answer. Yes — a TA is a Student with both an ENROLLED_IN and an ASSISTS relationship to a course; no separate TA label is needed. That’s “identity through relationships” again: model a role by the relationships present, not by adding labels, to avoid the label-explosion anti-pattern.

In Cypher text notation, what encloses a node, what encloses a relationship, and how is direction shown? Write the pattern for a Person who DIRECTED a Movie. KGR-1.6

Nodes are in parentheses (), relationships in square brackets [:TYPE], and direction with an arrow -> (or <-). Pattern: (p:Person)-[:DIRECTED]->(m:Movie). Labels follow a colon inside the node; properties go in {} on either a node or a relationship.

Where knowledge graphs are used

Give two real-world uses of knowledge graphs outside RAG, and what the graph encodes in each. KGR-1.5

Two of: web-search knowledge cards (entities like people/companies linked by typed relationships, aggregated into a summary panel); e-commerce recommendations (purchase/view relationships traversed for “also bought” suggestions); financial analysis (companies, investors, and filings as connected entities). In each, the graph encodes entities and the typed relationships between them, enabling look-ups and traversals that tabular storage handles awkwardly.

The course roadmap

The guide builds one capability per module onto the same graph (the course’s seven Modules map one-to-one to this guide’s Chapters — the terms are used interchangeably):

| Module | Topic | What gets added | | --- | --- | --- | | 1 | Fundamentals | Vocabulary: nodes, relationships, properties, labels | | 2 | Querying | Cypher: MATCH, WHERE, CREATE, MERGE, DELETE | | 3 | Preparing for RAG | Vector indexes, embeddings, similarity search | | 4 | Graph construction | SEC 10-K chunks as nodes; vector search on the graph | | 5 | Adding relationships | NEXT, PART_OF, SECTION; chunk-window retrieval | | 6 | Expanding the graph | Form 13 data; Company/Manager nodes; full-text index | | 7 | Chatting with the KG | LLM-generated Cypher; few-shot; combined retrieval |

Summary

A knowledge graph stores data as nodes and relationships, both carrying properties; labels group nodes, and relationship type + direction encode semantic meaning. That structure enables traversal-based retrieval that complements vector similarity — surfacing connections embeddings can’t. Cypher expresses queries as the very patterns you’d draw on a whiteboard. Reach for a graph when your questions are about connections across entities; for simple tabular look-ups, relational storage is simpler.

Chapter 2 — querying & Cypher: MATCH/MERGE/WHERE against a real dataset.
Chapters 3–6 — building a knowledge graph from SEC filings (vectors, then relationships, then expansion).
Chapter 7 — graph-RAG chat: an LLM writes the Cypher for you.