Part 1 Chapter 2 Last verified 2026-06-19

Querying Knowledge Graphs

A working Cypher tutorial — MATCH patterns with label/property filters and WHERE, one-hop and multi-hop relationship traversal (co-actors), modifying the graph with CREATE/MERGE/DELETE (and why MERGE is the idempotent default), connecting from Python via the LangChain Neo4j Graph class, and the parse → plan → execute → project execution model.

On this page
  1. Connecting from Python
  2. The movie dataset
  3. Reading the graph
  4. Traversing relationships
  5. Modifying the graph
  6. How a query executes
  7. Cypher pattern cheat-sheet
  8. Summary

Connecting from Python

The notebook talks to Neo4j through LangChain’s Neo4j Graph class, which wraps the connection and exposes a query() method that sends Cypher and returns Python dicts. It needs three credentials, loaded from the environment — never hard-coded: [V] Verified

from langchain_community.graphs import Neo4jGraph

kg = Neo4jGraph(url=NEO4J_URI, username=NEO4J_USERNAME, password=NEO4J_PASSWORD)
result = kg.query("MATCH (n) RETURN count(n)")   # [{'count(n)': 171}]
How does kg.query() connect a Python app to Neo4j, and what three parameters does the connection need? KGR-2.6

The LangChain Neo4jGraph class wraps a Neo4j driver: you construct it with a URL, a username, and a password (loaded from environment variables, not hard-coded), and it handles connection pooling and auth. Its query() method sends a Cypher string to the database and returns the results as a list of Python dictionaries — the bridge between application code and the graph.

The movie dataset

The training graph models Person and Movie nodes (171 nodes, 38 movies) joined by relationship types that make Chapter 1’s “identity through relationships” concrete — a person is an actor or a director purely by their relationships, not by a label:

| Relationship | Direction | Meaning | | --- | --- | --- | | ACTED_IN | Person → Movie | acted in | | DIRECTED | Person → Movie | directed | | WROTE | Person → Movie | wrote | | REVIEWED | Person → Movie | reviewed | | FOLLOWS | Person → Person | a reviewer follows another |

Person carries name, born; Movie carries title, tagline, released.

Reading the graph

A Cypher MATCH specifies a pattern; the engine binds variables to every matching element. Filter by label inside the parentheses, match a property with braces, or add conditions with WHERE:

MATCH (n) RETURN count(n) AS totalNodes              -- all nodes (171)
MATCH (m:Movie) RETURN count(m) AS movies            -- filter by label (38)
MATCH (tom:Person {name: "Tom Hanks"}) RETURN tom    -- property match
MATCH (m:Movie) WHERE m.released > 1990 AND m.released < 2000 RETURN m.title
Write a Cypher query that returns the titles of all movies released after 2000, and say which clause does the filtering. KGR-2.1

MATCH (m:Movie) WHERE m.released > 2000 RETURN m.title. The MATCH clause binds every Movie node; the WHERE clause filters them to those with released > 2000; RETURN m.title projects just the title. (Equivalently, a range like this needs WHERE — inline {...} property matching only does equality.)

Traversing relationships

Extend the pattern with relationship syntax to follow connections. The arrow gives direction — ACTED_IN goes Person → Movie:

// All movies Tom Hanks acted in
MATCH (tom:Person {name: "Tom Hanks"})-[:ACTED_IN]->(movie:Movie)
RETURN tom.name, movie.title

Multi-hop traversal is where graphs earn their keep: chain relationships to reach indirect connections. To find Tom Hanks’s co-actors, go forward to a shared movie, then backward (the reverse arrow <-) from the co-actor:

// Co-actors: forward to the movie, backward from the co-actor
MATCH (tom:Person {name: "Tom Hanks"})-[:ACTED_IN]->(m)<-[:ACTED_IN]-(coActor:Person)
RETURN coActor.name, m.title
Key concept

The shared node is the graph's JOIN

KGR-2.2

A two-hop pattern through a shared node — (a)-[:R]->(b)<-[:R]-(c) — is the canonical way to discover indirect connections; the middle node b is the graph equivalent of a SQL JOIN, but written as a visual pattern instead of nested JOINs. When it breaks: traversals explode combinatorially on high-degree nodes — a movie with 50 actors yields on the order of 50250^2 co-actor pairs, and three hops can return millions of rows. Always bound results with LIMIT or a label/property filter. [V] Verified

Modifying the graph

MATCH + RETURN reads; CREATE, MERGE, and DELETE write. Reads dominate in production, but graph-construction pipelines (Modules 4–6) live on writes.

CREATE (andreas:Person {name: "Andreas"}) RETURN andreas       -- unconditional insert

MATCH (a:Person {name: "Andreas"}), (e:Person {name: "Emil Eifrem"})
MERGE (a)-[r:KNOWS]->(e) RETURN r                              -- idempotent: only if absent

MATCH (p:Person {name: "Emil Eifrem"})-[r:ACTED_IN]->(:Movie)
DELETE r                                                       -- remove a relationship
Key concept

MERGE is the idempotent default

KGR-2.3

MERGE combines existence-check + create in one atomic step, so re-processing the same data yields the same graph — the pipeline is idempotent, which is exactly what you need when data arrives from unreliable sources (API retries, replays, duplicate messages). Pair it with a uniqueness constraint and ON CREATE SET / ON MATCH SET to set initial vs updated properties. When it breaks: if you MERGE a pattern that includes a volatile property (a changing timestamp), MERGE treats each value as new and creates a fresh element every run — so MERGE on the stable identifier, then ON MATCH SET the volatile fields. [V] Verified

Compare CREATE and MERGE for a bulk-loading pipeline that may retry batches. Which is safe, and why? KGR-2.3

CREATE inserts unconditionally, so a retried batch duplicates nodes/relationships (or hits a constraint error). MERGE checks for the element first and creates only if absent, so re-processing the same data leaves the graph unchanged — it’s idempotent, the property a retry-prone pipeline needs. Use MERGE on a stable identifier and ON MATCH SET for volatile fields; reserve CREATE for guaranteed-new inserts where you want to skip the existence check.

How a query executes

Understanding the pipeline helps you write fast queries:

Run EXPLAIN before a query to see the plan without executing; PROFILE to run it and see actual row counts per step.

Trace what the engine does for MATCH (p:Person)-[:ACTED_IN]->(m:Movie) RETURN p.name, m.title LIMIT 5. KGR-2.7

Parse the string into a pattern (a Person linked by ACTED_IN to a Movie). Plan: use the Person/Movie labels (and any index) to choose where to start and how to expand. Execute: find Person nodes, traverse outgoing ACTED_IN relationships to Movie nodes, binding p and m — stopping early thanks to LIMIT 5. Project: return only p.name and m.title for those 5 matches. Pattern spec → match → projection.

Cypher pattern cheat-sheet

Nine patterns cover most knowledge-graph query needs and recur through every later module:

| Need | Cypher | | --- | --- | | Count all nodes | MATCH (n) RETURN count(n) | | Filter by label | MATCH (m:Movie) RETURN m | | Property match | MATCH (p:Person {name: "Tom"}) RETURN p | | Conditional | MATCH (m:Movie) WHERE m.released > 2000 RETURN m | | One hop | MATCH (p)-[:ACTED_IN]->(m) RETURN p, m | | Two hops | MATCH (a)-[:R]->(b)<-[:R]-(c) RETURN a, c | | Create node | CREATE (n:Label {k: "v"}) RETURN n | | Merge relationship | MATCH (a),(b) MERGE (a)-[:R]->(b) | | Delete relationship | MATCH ()-[r:R]->() DELETE r |

Why should Cypher variable names be meaningful, given the engine runs the query the same regardless? KGR-2.5

Because the names are for humans, not the engine: (tom:Person)-[:ACTED_IN]->(m:Movie) reads as its own documentation, while (n1)-[:ACTED_IN]->(n2) forces the reader to reconstruct what each variable means. In multi-hop queries the difference is the gap between a pattern you can scan and one you have to decode — meaningful names are a maintainability investment with zero runtime cost.

Summary

Cypher mirrors the graph: () nodes, [] relationships, arrows for direction. MATCH binds patterns, labels and WHERE filter, dot notation projects. Multi-hop traversals through a shared node are the graph’s JOIN, natural where SQL needs nested self-JOINs. MERGE (not CREATE) is the idempotent default for writes; DELETE/DETACH DELETE respect referential integrity. These fundamentals power every later module — where you build, query, and extend a knowledge graph from SEC filings.

  • Chapter 3preparing text for RAG: vector indexes and embeddings on the graph.
  • Chapters 4–6 — constructing and expanding the SEC knowledge graph.
  • Chapter 7 — an LLM writes this Cypher for you.