Flashcards
Study the glossary as flashcards: shuffled, one term at a time — recall the definition, flip to check, and sort each card into "knew it" / "still learning".
69 cards · 0 marked known
Address Node
A graph node representing a physical location — city, state, and a geospatial location property
(a latitude/longitude point). Connected to Company and Manager nodes via LOCATED_AT
relationships, it adds the geography dimension to the graph and enables distance-based discovery
with point.distance(). Address nodes are themselves an instance of Extract-Enhance-Expand:
extract from CSV, enhance with a geospatial point, expand with LOCATED_AT.
Chunk Metadata
The structured fields stored alongside a chunk’s text that preserve its origin, position, and
identity: section (item), sequence position (chunkSeqId), the parent filing (formId), a
unique key (chunkId), a source link, and entity ids (cik, cusip6). Metadata is what lets
you rebuild document structure and link chunks to graph entities later — chunkSeqId in
particular preserves in-section order so Module 5 can wire chunks into a NEXT linked list. You
cannot reconstruct provenance you didn’t capture at ingest, so design the schema before loading.
Chunk Window
A retrieval pattern that returns a target chunk plus its neighbours in the NEXT linked list,
giving the LLM continuous context instead of one isolated chunk. A window of chunks
uses hops in each direction (default → a 3-chunk window). Built with variable-length
paths (*0..k) so it degrades gracefully at list boundaries. The payoff is concrete: expanded
context surfaces details an adjacent chunk held (e.g. a product name) that the single best-matching
chunk missed — improving completeness with no change to the embedding model.
Co-Located Vector Store
Storing embeddings as node properties in the same database as the graph (Neo4j) rather than in a separate vector database. The win: one query can combine vector similarity with graph traversal — retrieve a chunk and follow its relationships — in a single round trip, with one system to operate. The trade-off: you depend on Neo4j’s embedding/ANN support, where a best-of-breed vector DB may offer more algorithms and easier model swaps for large standalone vector workloads.
Company Node
A graph node representing a public company, keyed by its cusip6 and created idempotently with
MERGE. It is the hub that joins the two datasets: a FILED edge links it to the Form 10-K it
filed (matched on shared CUSIP), and incoming OWNS_STOCK_IN edges link it to every Manager that
holds its stock. Building Company nodes on the CUSIP key is what lets the independently created
Form 13 and Form 10-K data connect with no manual matching.
Cosine Similarity
A similarity metric measuring the angle between two vectors, ignoring magnitude: , ranging −1 to 1 (Neo4j’s vector-index score normalises it to 0–1). It is OpenAI’s recommended function, and for normalised vectors it ranks results identically to Euclidean distance — so prefer Euclidean only when vector magnitude carries meaning (e.g. raw TF-IDF). The standard choice for embedding search.
CREATE
The Cypher clause that unconditionally creates a node or relationship — even if an identical
one already exists. It is fast (no existence check), but dangerous for relationships in
retry-prone pipelines, where re-running the same CREATE silently inserts duplicates. Reserve
it for guaranteed-unique insertions; use MERGE everywhere a query might re-run.
CUSIP
Committee on Uniform Security Identification Procedures — a 9-character code uniquely identifying
a financial instrument; the first 6 characters (CUSIP 6) identify the issuing company. Because
both the Form 10-K and Form 13 datasets carry it, CUSIP 6 is the linking key that joins
independently created Company and Form nodes automatically (WHERE com.cusip6 = f.cusip6). It is
the worked example of cross-dataset linking via a shared universal identifier — the pattern that
fails (needing entity resolution) only when datasets use different identifier schemes.
Custom Retrieval Query
A Cypher query passed to Neo4jVector via the retrieval_query argument that extends the default
vector search. It runs after the index match, receiving the matched node and its score, and
can perform extra graph traversal before returning context to the LLM — for example following NEXT
to build a chunk window. This is the hook that fuses vector search (find the entry point) with
graph traversal (expand context) into a single retrieval step.
Cypher
Neo4j’s declarative, pattern-matching query language. It uses ASCII-art notation that
mirrors a whiteboard diagram — () for nodes, [] for relationships, -> for direction — so
you draw the pattern you want and Cypher returns every matching subgraph. Self-documenting:
MATCH (p:Person)-[:ACTED_IN]->(m:Movie) reads as the graph structure it matches.
Cypher Execution Model
The four stages a Cypher query passes through: parse (string → abstract syntax tree),
plan (the optimiser chooses indexes, scan order, join strategy), execute (traverse the
graph hop by hop, binding variables), and project (the RETURN clause extracts only the
requested properties/aggregations). EXPLAIN shows the plan without running; PROFILE runs it
and reports actual per-step row counts — the tools for tuning a slow query.
db.create.setNodeVectorProperty
Neo4j’s procedure that writes a computed embedding onto a node as a vector property — the
storage step that follows genai.vector.encode. Persisting the vector on the node (rather
than in a separate store) is exactly what makes the data co-located, enabling a single query to
later combine similarity search with graph traversal.
db.index.vector.queryNodes
Neo4j’s procedure for top-K approximate nearest-neighbour search over a vector index. It
takes the index name, a topK count, and a query vector, and YIELDs the matched nodes with a
similarity score. It is the search half of graph-RAG retrieval: encode the question (with
genai.vector.encode), then queryNodes to find the closest stored embeddings.
DELETE
The Cypher clause that removes matched nodes or relationships (returns no data). It cannot delete a node that still has relationships — Neo4j protects referential integrity — so you must delete the relationships first, or use DETACH DELETE to remove the node and its relationships together.
DETACH DELETE
A DELETE variant that removes a node and all of its relationships in a single step,
satisfying referential integrity without manually deleting each connecting relationship first.
It is the safe, idiomatic way to drop a connected node — a plain DELETE on such a node fails.
EDGAR
Electronic Data Gathering, Analysis, and Retrieval — the SEC’s public database for corporate filings, including Form 10-K and Form 13. It is the open access point for the raw documents that seed a financial knowledge graph: every filing is keyed by the company’s CIK (Central Index Key), the identifier that links a company across all of its filings.
Embedding
A dense vector representation of text in high-dimensional space, where semantically similar texts have high cosine similarity. OpenAI’s default model outputs 1536-dimensional vectors. The embedding model is the ceiling of retrieval quality: if it encodes two related concepts as distant vectors, no index tuning recovers the match — which is why domain-specific corpora sometimes need fine-tuned or specialised models.
Embedding Dimensionality
The number of dimensions in an embedding vector (1536 for OpenAI’s default model). Higher dimensionality captures more semantic nuance but costs more storage and search time. The key operational constraint: a vector index’s declared dimension must match the embedding model’s output exactly — a mismatch (index 768 vs vectors 1536) makes a similarity query fail with a dimension-mismatch error.
Extract-Enhance-Expand
A repeatable three-phase pattern for growing a knowledge graph incrementally, applied at every stage of this course. Extract: pull source data into nodes (text → Chunks; CSV → Company / Manager / Address nodes). Enhance: add indexes or computed properties (vector embeddings, full-text indexes, geospatial points). Expand: connect the new nodes into the existing graph with relationships (NEXT, PART_OF, FILED, OWNS_STOCK_IN, LOCATED_AT). Its one assumption — that new data links to existing nodes by a shared identifier — is exactly where it needs an entity-resolution step when identifiers are absent or ambiguous.
Few-Shot Prompt
A prompt containing a handful of worked example pairs that teach an LLM a pattern by demonstration rather than instruction. For Cypher generation, each example pairs a natural-language question with its correct query, plus the injected schema and a “use only these types; do not hallucinate” instruction — the LLM then generalises to unseen questions. Two to three diverse examples (a filter, an aggregation, a traversal) cover far more ground than many similar ones; past that, returns diminish.
FILED Relationship
A directed relationship from a Company to a Form (Company -[:FILED]-> Form) recording that the
company filed that SEC document. It is created by matching Company and Form nodes on their shared
cusip6 — the concrete realisation of cross-dataset linking. Traversed backward from a form
(Form <-[:FILED]- Company), it is the bridge hop that connects document content to the company
and, onward, to that company’s investors.
Form Node
A graph node representing an SEC filing document, carrying the metadata its chunks share —
formId, source (URL back to the filing), cik, and cusip6. It is the parent that chunks
connect up to via PART_OF, and the entry point that links down to the first chunk of each
section via SECTION. Its cusip6 property is the bridge to Module 6, where Company nodes join
forms through the same CUSIP identifier.
Full-Text Index
A Neo4j index optimised for string matching — exact, partial, and fuzzy keyword search with
relevance scoring — created with CREATE FULLTEXT INDEX ... FOR (n:Label) ON EACH [n.prop] and
queried via db.index.fulltext.queryNodes. It complements the other two paradigms: a vector
index matches meaning (semantic similarity), a property index does exact lookup by a
known id, and full-text matches spelling. Having all three in one engine means keyword and
semantic retrieval need no separate search infrastructure.
genai.vector.encode
Neo4j’s built-in function that calls an external embedding API (e.g. OpenAI) from inside
Cypher: it takes the text, a provider name ("OpenAI"), and a config map (with the API
token passed as a query parameter), and returns a float-array embedding. Paired with
db.create.setNodeVectorProperty to store the vector on the node — generating and persisting
embeddings without leaving the database.
Geospatial Point
A Neo4j data type representing a location on Earth as latitude/longitude, stored with point({ latitude: x, longitude: y }) and held on a node property (e.g. Address.location). It is the
input to point.distance() for proximity queries, and a point index on the property keeps
range searches performant. A common bug is reversing latitude and longitude — verify coordinate
order against the source data.
Graph Traversal
Retrieval by following typed relationships from a starting node to connected nodes — the graph analogue of adjacency. It answers “what is connected to this?”, surfacing connection-based context (a company’s investors, a document’s sections) that similarity search misses. Cost is roughly in the local fan-out ( = avg connections, = hops), versus self-JOINs over a whole table in SQL.
Graph-to-Text Context Generation
Converting structured graph-traversal results into natural-language sentences an LLM can read as context — turning a relationship triple (subject–predicate–object) into a human-readable statement like “Royal Bank of Canada owns 1.2M shares of NetApp”. It is how graph data enters a RAG prompt: the traversal supplies precise, connected facts, and the sentence form makes them digestible to the model. Module 7 inverts the direction — the LLM generates the query instead of consuming the result.
GraphCypherQAChain
LangChain’s chain that turns a natural-language question into a graph answer end to end: the LLM generates Cypher (guided by the injected schema and few-shot examples), Neo4j executes it, and the LLM synthesises the rows into a natural-language reply — question → Cypher → execute → answer. It needs the schema and a good few-shot prompt for reliable generation, and its main risk is schema hallucination (invented relationship types), mitigated by schema injection and validating the generated query before running it.
Hallucination
When an LLM produces plausible but factually wrong output. In RAG it has two distinct failure modes: (1) invention — the model fabricates facts absent from the retrieved context; and (2) mis-attribution — it applies entity A’s retrieved context to entity B (e.g. answering a question about Apple using NetApp’s chunks, because those were the nearest vectors available). Mitigations layer: prompt instructions (“use ONLY this context”; “say you don’t know”), scoping the context to named sources, and retrieval-score gating so out-of-scope questions return no context. Prompt engineering helps but is not foolproof — pair it with a score threshold.
Hybrid Search
Combining vector similarity with another signal — full-text/keyword filtering, metadata pre-filters, or a cross-encoder re-rank — to raise precision without changing the embedding model. It is the standard fix for “semantic near-miss” false positives (results that are topically close but not what was asked). Smaller, more granular chunks help similarly by making each embedding more specific.
Idempotent Pipeline
A graph-construction pipeline that produces the same graph no matter how many times it
runs — the property retry-prone ingestion needs, since data arrives from unreliable sources
(API retries, replays, duplicate messages). Achieved with a uniqueness constraint + MERGE on
a stable identifier + ON CREATE SET/ON MATCH SET, giving exactly-once semantics. The
production standard for all graph construction.
Identity Through Relationships
The design pattern where a node’s role is encoded by the typed relationships it participates
in, not by extra labels: a Person is an actor because it has an ACTED_IN relationship,
a director because it has DIRECTED. It keeps the schema clean and avoids “label explosion.”
Where it breaks: when a role needs role-specific properties (an actor’s per-film salary),
which then have to live on the relationship and can get unwieldy at scale.
Knowledge Graph
Data stored as interconnected entities — nodes joined by typed, directed relationships, both carrying properties. The “diagram is the data.” Its value for RAG is that retrieval can traverse connections rather than rely on embedding distance alone, surfacing related entities (investors, org charts, filings) that vector similarity cannot reach.
Label
A tag applied to a node that groups it with similar entities — Person, Company,
Movie. A node can have several labels, and labels enable efficient filtering in queries
(MATCH (p:Person)). Distinct from a relationship type: labels classify nodes, types
classify connections. Over-using labels for roles is an anti-pattern — prefer
identity through relationships.
LOCATED_AT Relationship
A directed relationship from a Company or Manager to an Address node, connecting an entity to its
physical location. It is the hop that makes geographic questions answerable: traverse LOCATED_AT to
reach an entity’s coordinates, then compare with point.distance() to find what’s nearby. The
Expand step of the M7 Extract-Enhance-Expand cycle.
Manager Node
A graph node representing an institutional investment-management firm, keyed by its managerCik
and carrying name and address. It connects to Company nodes via OWNS_STOCK_IN relationships that
record share count, value, and reporting quarter. Manager names are also indexed with a full-text
index, so a user can find a firm by string (“Royal Bank”) even without its CIK.
MATCH Clause
Cypher’s primary read clause: it specifies a graph pattern in ASCII-art notation, and the
engine finds every matching subgraph and binds variables to the matched elements. Filter inline
by label ((m:Movie)) or property ({name: "Tom"}), or add richer conditions with a
WHERE clause. MATCH is the start of nearly every query.
MERGE
The Cypher clause that creates a node or relationship only if it does not already exist
(MATCH-then-create in one atomic step). It is the idempotent default for graph
construction: re-processing the same data leaves the graph unchanged. Pair it with a uniqueness
constraint and ON CREATE SET / ON MATCH SET for initial-vs-updated properties — and always
MERGE on a stable identifier, since a volatile property (a changing timestamp) makes MERGE
create a fresh element each run.
Multi-Hop Pattern
A Cypher pattern that chains several relationships to express an indirect connection — e.g.
(a)-[:X]->(b)-[:Y]->(c). Matching subgraphs across multiple hops is the traversal a graph does
cheaply and a relational database does expensively (one self-JOIN per hop). Multi-hop patterns
are how graph-RAG reaches context several relationships away from the initial match.
Multi-Hop Traversal
Following a chain of two or more relationships to reach indirect connections — co-actors,
co-directors, friends-of-friends. The shared middle node (a)-[:R]->(b)<-[:R]-(c) is the
graph’s equivalent of a SQL JOIN, written as a visual pattern. Cheap where SQL needs nested
self-JOINs, but it explodes combinatorially on high-degree nodes (a 50-actor movie yields
co-actor pairs), so bound it with LIMIT or filters.
Neo4j Graph (LangChain)
LangChain’s wrapper class for Neo4j connections. Constructed with a url, username, and
password (loaded from environment variables, not hard-coded), it exposes a query() method
that sends a Cypher string to the database and returns results as a list of Python
dictionaries, handling connection pooling and authentication. It is the bridge between
application code and the graph used throughout the course.
Neo4jVector
LangChain’s vector-store interface for Neo4j — it makes the graph look like a standard vector
store, running Cypher similarity search internally. from_existing_graph connects to an
already-built index by naming the index_name, the node_label (e.g. Chunk), the text
property (text_node_properties), and the embedding property (embedding_node_property); calling
.as_retriever() yields a retriever for a RAG chain. Currency: it now lives in the dedicated
langchain-neo4j package (2024-11), not langchain_community.
NEXT Relationship
A directed relationship connecting sequential chunks within the same form section, forming a
singly-linked list that preserves document order. Built by matching chunks whose chunkSeqId
differ by one and that share formId and item — the section filter is mandatory, or the list
bleeds across section boundaries. Created with MERGE for idempotency. NEXT is what makes
sequential traversal and chunk-window retrieval possible; the same pattern models chat threads,
audit trails, and event streams.
Node
A data record representing an entity in a knowledge graph — written (...) in Cypher. It
carries one or more labels (for grouping) and zero or more
properties (key-value pairs). Equivalent to a graph-theory vertex, but
“node” is preferred for data modelling. A Person node, a Company node, a Movie node.
OWNS_STOCK_IN Relationship
A directed relationship from a Manager to a Company representing a stock holding, with properties
for share count, monetary value, and reporting quarter. One manager can hold multiple
OWNS_STOCK_IN edges to the same company across different quarters, so the reporting period is
part of the relationship’s identity (it is MERGE’d on reportCalendarOrQuarter). In the course
data, 561 of these edges all point at NetApp — the investment side of the multi-hop chunk → form →
company → manager traversal.
PART_OF Relationship
A directed relationship from a Chunk to its parent Form (Chunk -[:PART_OF]-> Form), establishing
document membership. It lets you traverse up from any chunk back to its source filing — and from
there onward to connected entities (companies, investors) in later modules. The general pattern
(member -[:PART_OF]-> container) generalises to any document hierarchy: chunk → section →
chapter → book.
point.distance()
A Neo4j function that returns the geodesic (great-circle) distance between two geospatial points,
in metres. Used in a WHERE clause for radius queries — WHERE point.distance(a.location, b.location) < 10000 finds entities within 10 km — and divided by 1000 for kilometres. The unit is
the classic trap: 10 km is 10000, not 10; 50 miles is 80467. Pair it with a point index, or the
query degrades to an O(n) scan of every candidate node.
Progressive Few-Shot Learning
The practice of adding few-shot examples incrementally, where each example unlocks a new query
capability the LLM generalises from — a city-filter example enables filtering for any entity, a
point.distance() example enables geospatial search, a full-text-plus-SECTION example enables
document navigation. The guiding principle is diversity over quantity: two or three examples
spanning different query shapes (filter, aggregate, traverse) outperform ten variations of one.
Property
A key-value pair stored on a node or a relationship (keys are strings; values are strings,
numbers, booleans, or lists) — e.g. {name: "Andreas", born: 1975}. Crucially, relationships
carry properties too: a since year on a WORKS_AT describes the connection itself, which is
part of why a relationship is more than a graph-theory edge.
Property Graph
The data model Neo4j implements: nodes and relationships that both carry properties, with labels grouping nodes and types+direction on relationships carrying semantics. It contrasts with the RDF triple model (subject–predicate–object); the property-graph model is what makes a relationship a rich record rather than a bare link, and underlies every example in this guide.
Query Parameter
A value passed to a Cypher query via the params argument and referenced with a $ prefix
(e.g. $openAiApiKey). It keeps secrets and user input out of the query string — no
injection, no leakage into logs or the query cache — and lets Neo4j cache the execution
plan across calls with different values. The one exception: DDL statements (CREATE INDEX,
CREATE CONSTRAINT) can’t be parameterised, so sanitise those inputs manually.
Relationship
A directed, typed connection between two nodes — written -[:TYPE]-> in Cypher. It is
itself a rich record: a start and end node, a type (ACTED_IN, OWNS_STOCK_IN), a
direction, and optional properties. Preferred over the graph-theory term
“edge” because it conveys richness beyond a bare link — type and direction are the semantics.
Retrieval-Augmented Generation
A system architecture that retrieves relevant context from an external store and injects it into the LLM prompt before generation, grounding the response in evidence and reducing hallucination. Basic RAG retrieves by vector similarity; graph-RAG adds traversal-based retrieval, so context can follow connections, not just semantic closeness.
RetrievalQA
LangChain’s question-answering chain that wires a retriever (vector search) to an LLM: the
retriever finds relevant chunks, and the "stuff" chain type injects them into the prompt as
context for the LLM to answer from. A custom PromptTemplate controls behaviour — crucially, the
refusal instruction (“say you don’t know”) that curbs hallucination. Currency: RetrievalQA is
deprecated since LangChain 0.1.17 (removal repeatedly deferred; still importable in the 1.x line); the current
replacement is create_retrieval_chain with create_stuff_documents_chain. The connect →
retrieve → generate concept is unchanged.
RETURN Clause
The Cypher clause that projects results: only the properties and aggregations it names are
extracted and returned (RETURN p.name, count(m)). It is the final stage of execution, and
returning whole nodes (RETURN p) versus specific properties (RETURN p.name) is a
bandwidth/clarity choice — project just what the caller needs.
Schema-Optional
A knowledge graph’s property of allowing new node labels and relationship types without DDL
migrations — flexible where a relational schema is fixed and requires ALTER TABLE. The
caveat: optional, not free. Without naming standards and uniqueness constraints the graph
becomes a tangled mess, so production knowledge graphs still need governance even without an
enforced schema.
Score Threshold
A minimum similarity score below which results are discarded, so weak context never reaches the LLM (where it would invite hallucination). There is no universal value — calibrate against the corpus’s own score distribution: raise it for precision (customer-facing), lower it for recall (research). Because scores are relative, a threshold tuned on one corpus rarely transfers to another.
SEC Form 10-K
The annual report public companies file with the U.S. Securities and Exchange Commission — long,
standardised business text that makes excellent knowledge-graph source data. The sections that
matter for construction are Item 1 (business description), Item 1A (risk factors),
Item 7 (management discussion & analysis), and Item 7A (market-risk disclosures). Raw
filings are XML; pre-processing extracts these sections plus the CIK and CUSIP identifiers
into JSON for ingestion.
SEC Form 13
A filing made quarterly by institutional investment-management firms reporting their holdings of public-company stock. Each record carries the manager’s identity (name, CIK, address), the company’s identity (via CUSIP), and the position (share count, monetary value, reporting quarter). In this guide, Form 13 is the second dataset: connecting it to the existing Form 10-K graph turns documents into a graph of who invests in whom, answering questions vector search alone cannot.
SECTION Relationship
A directed relationship from a Form to the first chunk (sequence id 0) of each section,
carrying an f10kItem property that names the section (e.g. "item1"). It gives the form indexed
entry points into the document: a query can jump straight to where a section starts, then follow
NEXT through its content — hierarchical navigation without scanning the whole linked list.
Semantic Search
Retrieval by meaning rather than exact words: encode the query and the corpus into the same embedding space and rank by similarity, so “movies about love” matches a tagline that never contains the word “love.” It is the capability a vector index adds to the graph — and it complements (does not replace) graph traversal, which finds connected entities rather than similar text.
Similarity Score
The number vector search returns for each result — for cosine, 0 (unrelated) to 1 (identical direction) — used to rank results and to filter weak ones. Crucially, scores are relative to the embedding model and corpus: 0.85 on movie taglines is not the same quality as 0.85 on SEC filings. Read the distribution (a clear winner vs a flat band) rather than trusting an absolute value.
Text Chunking
Splitting a long document into smaller, overlapping segments so each fits an embedding model and
maps to one retrievable unit. Chunk size sets granularity (too large → generic embeddings,
weak precision; too small → fragments that lose context); overlap carries context across
boundaries so a fact split between two chunks survives in at least one. The number of chunks is
roughly for length , size , overlap — the effective advance
is size − overlap. The size is a tuning parameter; the splitting strategy (fixed-size vs
structure-aware/recursive) is a design decision driven by the document’s shape.
Uniqueness Constraint
A Neo4j constraint — CREATE CONSTRAINT ... FOR (p:Post) REQUIRE p.postId IS UNIQUE — that
guarantees at most one node per key value and creates an implicit index in the process. It
underpins idempotent, MERGE-based ingestion: correctness (no duplicates) and performance
(the index speeds the MERGE existence check). Part of the production idempotent-ingest pattern.
Variable Naming (Cypher)
The practice of naming Cypher pattern variables for what they hold — tom, coActor, m
for Movie — rather than n1/n2. It has zero runtime cost (the engine ignores the names)
but a large readability payoff: a multi-hop pattern with meaningful names is self-documenting,
while cryptic names force the next reader to decode it. A maintainability investment, part of
“design readable queries” (KGR-2.5).
Variable-Length Path
A Cypher pattern that matches a range of relationship hops rather than a fixed number: *0..1
matches zero or one hop, *1..3 matches one to three, *0..2 zero to two. It is the standard fix
for boundary conditions where a fixed-length pattern returns nothing — e.g. a chunk window at the
first or last chunk of a linked list, where the missing side simply matches zero hops.
Pairing it with ORDER BY length(window) DESC LIMIT 1 keeps the longest window available at any
position.
Vector Index
A Neo4j index for approximate nearest-neighbour search over a node property that holds
embeddings, created with a fixed dimensionality and similarity function (CREATE VECTOR INDEX … OPTIONS {indexConfig: {…}}). Two hard requirements: the declared dimension must match the
embedding model’s output exactly, and the index must reach ONLINE status (not
POPULATING) before searches return complete results.
Vector Similarity Search
Retrieval that encodes text as high-dimensional embeddings and finds nearest neighbours by a distance metric (cosine). Fast and effective for semantic matching — “what text is similar to my question?” — but it cannot follow structural relationships between entities. That gap is exactly what graph traversal fills in a graph-RAG system.
WHERE Clause
Filters matched patterns by conditions beyond inline equality — comparisons (>, <,
=), boolean logic (AND/OR/NOT), string matching, and regular expressions. Rule of thumb:
use inline {...} property matching for simple equality, and WHERE for ranges or complex
logic. MATCH (m:Movie) WHERE m.released > 2000 is the canonical example.