Glossary

69 terms.

Address Node (Address node)

A graph node representing a physical location — city, state, and a geospatial location property (a latitude/longitude point). Connected to Company and Manager nodes via LOCATED_AT relationships, it adds the geography dimension to the graph and enables distance-based discovery with point.distance(). Address nodes are themselves an instance of Extract-Enhance-Expand: extract from CSV, enhance with a geospatial point, expand with LOCATED_AT.

See also: LOCATED_AT Relationship, Geospatial Point, point.distance()

Chunk Metadata (chunk record)

The structured fields stored alongside a chunk’s text that preserve its origin, position, and identity: section (item), sequence position (chunkSeqId), the parent filing (formId), a unique key (chunkId), a source link, and entity ids (cik, cusip6). Metadata is what lets you rebuild document structure and link chunks to graph entities later — chunkSeqId in particular preserves in-section order so Module 5 can wire chunks into a NEXT linked list. You cannot reconstruct provenance you didn’t capture at ingest, so design the schema before loading.

See also: Text Chunking, Idempotent Pipeline

Chunk Window (chunk window, context window expansion)

A retrieval pattern that returns a target chunk plus its neighbours in the NEXT linked list, giving the LLM continuous context instead of one isolated chunk. A window of $w = 2k+1$ chunks uses $k$ hops in each direction (default $k=1$ → a 3-chunk window). Built with variable-length paths (*0..k) so it degrades gracefully at list boundaries. The payoff is concrete: expanded context surfaces details an adjacent chunk held (e.g. a product name) that the single best-matching chunk missed — improving completeness with no change to the embedding model.

See also: NEXT Relationship, Variable-Length Path, Custom Retrieval Query

Co-Located Vector Store (Neo4j as vector store, co-location)

Storing embeddings as node properties in the same database as the graph (Neo4j) rather than in a separate vector database. The win: one query can combine vector similarity with graph traversal — retrieve a chunk and follow its relationships — in a single round trip, with one system to operate. The trade-off: you depend on Neo4j’s embedding/ANN support, where a best-of-breed vector DB may offer more algorithms and easier model swaps for large standalone vector workloads.

See also: Vector Index, Hybrid Search

Company Node (Company node)

A graph node representing a public company, keyed by its cusip6 and created idempotently with MERGE. It is the hub that joins the two datasets: a FILED edge links it to the Form 10-K it filed (matched on shared CUSIP), and incoming OWNS_STOCK_IN edges link it to every Manager that holds its stock. Building Company nodes on the CUSIP key is what lets the independently created Form 13 and Form 10-K data connect with no manual matching.

See also: CUSIP, FILED Relationship, OWNS_STOCK_IN Relationship

Cosine Similarity (cosine)

A similarity metric measuring the angle between two vectors, ignoring magnitude: $\cos\theta = \frac{A \cdot B}{\lVert A\rVert\,\lVert B\rVert}$ , ranging −1 to 1 (Neo4j’s vector-index score normalises it to 0–1). It is OpenAI’s recommended function, and for normalised vectors it ranks results identically to Euclidean distance — so prefer Euclidean only when vector magnitude carries meaning (e.g. raw TF-IDF). The standard choice for embedding search.

See also: Similarity Score, Embedding

CREATE (CREATE clause)

The Cypher clause that unconditionally creates a node or relationship — even if an identical one already exists. It is fast (no existence check), but dangerous for relationships in retry-prone pipelines, where re-running the same CREATE silently inserts duplicates. Reserve it for guaranteed-unique insertions; use MERGE everywhere a query might re-run.

See also: MERGE, DELETE

CUSIP (CUSIP 6, cusip6)

Committee on Uniform Security Identification Procedures — a 9-character code uniquely identifying a financial instrument; the first 6 characters (CUSIP 6) identify the issuing company. Because both the Form 10-K and Form 13 datasets carry it, CUSIP 6 is the linking key that joins independently created Company and Form nodes automatically (WHERE com.cusip6 = f.cusip6). It is the worked example of cross-dataset linking via a shared universal identifier — the pattern that fails (needing entity resolution) only when datasets use different identifier schemes.

See also: Company Node, FILED Relationship, SEC Form 10-K

Custom Retrieval Query (retrieval_query)

A Cypher query passed to Neo4jVector via the retrieval_query argument that extends the default vector search. It runs after the index match, receiving the matched node and its score, and can perform extra graph traversal before returning context to the LLM — for example following NEXT to build a chunk window. This is the hook that fuses vector search (find the entry point) with graph traversal (expand context) into a single retrieval step.

See also: Chunk Window, Neo4jVector, Vector Similarity Search

Cypher (Cypher query language)

Neo4j’s declarative, pattern-matching query language. It uses ASCII-art notation that mirrors a whiteboard diagram — () for nodes, [] for relationships, -> for direction — so you draw the pattern you want and Cypher returns every matching subgraph. Self-documenting: MATCH (p:Person)-[:ACTED_IN]->(m:Movie) reads as the graph structure it matches.

See also: Knowledge Graph, Multi-Hop Pattern

Cypher Execution Model (query execution, EXPLAIN PROFILE)

The four stages a Cypher query passes through: parse (string → abstract syntax tree), plan (the optimiser chooses indexes, scan order, join strategy), execute (traverse the graph hop by hop, binding variables), and project (the RETURN clause extracts only the requested properties/aggregations). EXPLAIN shows the plan without running; PROFILE runs it and reports actual per-step row counts — the tools for tuning a slow query.

See also: MATCH Clause, RETURN Clause

db.create.setNodeVectorProperty (setNodeVectorProperty)

Neo4j’s procedure that writes a computed embedding onto a node as a vector property — the storage step that follows genai.vector.encode. Persisting the vector on the node (rather than in a separate store) is exactly what makes the data co-located, enabling a single query to later combine similarity search with graph traversal.

See also: genai.vector.encode, Co-Located Vector Store

db.index.vector.queryNodes (vector queryNodes, queryNodes)

Neo4j’s procedure for top-K approximate nearest-neighbour search over a vector index. It takes the index name, a topK count, and a query vector, and YIELDs the matched nodes with a similarity score. It is the search half of graph-RAG retrieval: encode the question (with genai.vector.encode), then queryNodes to find the closest stored embeddings.

See also: Vector Index, Similarity Score

DELETE (DELETE clause)

The Cypher clause that removes matched nodes or relationships (returns no data). It cannot delete a node that still has relationships — Neo4j protects referential integrity — so you must delete the relationships first, or use DETACH DELETE to remove the node and its relationships together.

See also: DETACH DELETE, MERGE

DETACH DELETE (detach delete)

A DELETE variant that removes a node and all of its relationships in a single step, satisfying referential integrity without manually deleting each connecting relationship first. It is the safe, idiomatic way to drop a connected node — a plain DELETE on such a node fails.

See also: DELETE

EDGAR (SEC EDGAR)

Electronic Data Gathering, Analysis, and Retrieval — the SEC’s public database for corporate filings, including Form 10-K and Form 13. It is the open access point for the raw documents that seed a financial knowledge graph: every filing is keyed by the company’s CIK (Central Index Key), the identifier that links a company across all of its filings.

See also: SEC Form 10-K

Embedding (text embedding, embedding vector)

A dense vector representation of text in high-dimensional space, where semantically similar texts have high cosine similarity. OpenAI’s default model outputs 1536-dimensional vectors. The embedding model is the ceiling of retrieval quality: if it encodes two related concepts as distant vectors, no index tuning recovers the match — which is why domain-specific corpora sometimes need fine-tuned or specialised models.

See also: Embedding Dimensionality, Cosine Similarity, Semantic Search

Embedding Dimensionality (vector dimensions)

The number of dimensions in an embedding vector (1536 for OpenAI’s default model). Higher dimensionality captures more semantic nuance but costs more storage and search time. The key operational constraint: a vector index’s declared dimension must match the embedding model’s output exactly — a mismatch (index 768 vs vectors 1536) makes a similarity query fail with a dimension-mismatch error.

See also: Embedding, Vector Index

Extract-Enhance-Expand (extract enhance expand, EEE pattern)

A repeatable three-phase pattern for growing a knowledge graph incrementally, applied at every stage of this course. Extract: pull source data into nodes (text → Chunks; CSV → Company / Manager / Address nodes). Enhance: add indexes or computed properties (vector embeddings, full-text indexes, geospatial points). Expand: connect the new nodes into the existing graph with relationships (NEXT, PART_OF, FILED, OWNS_STOCK_IN, LOCATED_AT). Its one assumption — that new data links to existing nodes by a shared identifier — is exactly where it needs an entity-resolution step when identifiers are absent or ambiguous.

See also: LOCATED_AT Relationship, Idempotent Pipeline, Knowledge Graph

Few-Shot Prompt (few-shot prompting, few-shot examples)

A prompt containing a handful of worked example pairs that teach an LLM a pattern by demonstration rather than instruction. For Cypher generation, each example pairs a natural-language question with its correct query, plus the injected schema and a “use only these types; do not hallucinate” instruction — the LLM then generalises to unseen questions. Two to three diverse examples (a filter, an aggregation, a traversal) cover far more ground than many similar ones; past that, returns diminish.

See also: GraphCypherQAChain, Progressive Few-Shot Learning

FILED Relationship (FILED)

A directed relationship from a Company to a Form (Company -[:FILED]-> Form) recording that the company filed that SEC document. It is created by matching Company and Form nodes on their shared cusip6 — the concrete realisation of cross-dataset linking. Traversed backward from a form (Form <-[:FILED]- Company), it is the bridge hop that connects document content to the company and, onward, to that company’s investors.

See also: Company Node, CUSIP, Form Node

Form Node (Form node)

A graph node representing an SEC filing document, carrying the metadata its chunks share — formId, source (URL back to the filing), cik, and cusip6. It is the parent that chunks connect up to via PART_OF, and the entry point that links down to the first chunk of each section via SECTION. Its cusip6 property is the bridge to Module 6, where Company nodes join forms through the same CUSIP identifier.

See also: PART_OF Relationship, SECTION Relationship, Chunk Metadata

Full-Text Index (fulltext index, keyword index)

A Neo4j index optimised for string matching — exact, partial, and fuzzy keyword search with relevance scoring — created with CREATE FULLTEXT INDEX ... FOR (n:Label) ON EACH [n.prop] and queried via db.index.fulltext.queryNodes. It complements the other two paradigms: a vector index matches meaning (semantic similarity), a property index does exact lookup by a known id, and full-text matches spelling. Having all three in one engine means keyword and semantic retrieval need no separate search infrastructure.

See also: Vector Index, Semantic Search

genai.vector.encode (genai vector encode)

Neo4j’s built-in function that calls an external embedding API (e.g. OpenAI) from inside Cypher: it takes the text, a provider name ("OpenAI"), and a config map (with the API token passed as a query parameter), and returns a float-array embedding. Paired with db.create.setNodeVectorProperty to store the vector on the node — generating and persisting embeddings without leaving the database.

See also: Embedding, Query Parameter, db.create.setNodeVectorProperty

Geospatial Point (point, location point)

A Neo4j data type representing a location on Earth as latitude/longitude, stored with point({ latitude: x, longitude: y }) and held on a node property (e.g. Address.location). It is the input to point.distance() for proximity queries, and a point index on the property keeps range searches performant. A common bug is reversing latitude and longitude — verify coordinate order against the source data.

See also: point.distance(), Address Node

Graph Traversal (traversal)

Retrieval by following typed relationships from a starting node to connected nodes — the graph analogue of adjacency. It answers “what is connected to this?”, surfacing connection-based context (a company’s investors, a document’s sections) that similarity search misses. Cost is roughly $O(k^d)$ in the local fan-out ( $k$ = avg connections, $d$ = hops), versus $d$ self-JOINs over a whole table in SQL.

See also: Knowledge Graph, Vector Similarity Search, Multi-Hop Pattern

Graph-to-Text Context Generation (graph-to-text, triple-to-sentence)

Converting structured graph-traversal results into natural-language sentences an LLM can read as context — turning a relationship triple (subject–predicate–object) into a human-readable statement like “Royal Bank of Canada owns 1.2M shares of NetApp”. It is how graph data enters a RAG prompt: the traversal supplies precise, connected facts, and the sentence form makes them digestible to the model. Module 7 inverts the direction — the LLM generates the query instead of consuming the result.

See also: Multi-Hop Traversal, Custom Retrieval Query

GraphCypherQAChain (GraphCypherQAChain, LLM-generated Cypher)

LangChain’s chain that turns a natural-language question into a graph answer end to end: the LLM generates Cypher (guided by the injected schema and few-shot examples), Neo4j executes it, and the LLM synthesises the rows into a natural-language reply — question → Cypher → execute → answer. It needs the schema and a good few-shot prompt for reliable generation, and its main risk is schema hallucination (invented relationship types), mitigated by schema injection and validating the generated query before running it.

See also: Few-Shot Prompt, Neo4j Graph (LangChain), RetrievalQA

Hallucination (RAG hallucination)

When an LLM produces plausible but factually wrong output. In RAG it has two distinct failure modes: (1) invention — the model fabricates facts absent from the retrieved context; and (2) mis-attribution — it applies entity A’s retrieved context to entity B (e.g. answering a question about Apple using NetApp’s chunks, because those were the nearest vectors available). Mitigations layer: prompt instructions (“use ONLY this context”; “say you don’t know”), scoping the context to named sources, and retrieval-score gating so out-of-scope questions return no context. Prompt engineering helps but is not foolproof — pair it with a score threshold.

See also: Retrieval-Augmented Generation, RetrievalQA, Score Threshold

Hybrid Search (hybrid retrieval, re-ranking)

Combining vector similarity with another signal — full-text/keyword filtering, metadata pre-filters, or a cross-encoder re-rank — to raise precision without changing the embedding model. It is the standard fix for “semantic near-miss” false positives (results that are topically close but not what was asked). Smaller, more granular chunks help similarly by making each embedding more specific.

See also: Similarity Score, Score Threshold

Idempotent Pipeline (idempotency, exactly-once ingest)

A graph-construction pipeline that produces the same graph no matter how many times it runs — the property retry-prone ingestion needs, since data arrives from unreliable sources (API retries, replays, duplicate messages). Achieved with a uniqueness constraint + MERGE on a stable identifier + ON CREATE SET/ON MATCH SET, giving exactly-once semantics. The production standard for all graph construction.

See also: MERGE, Uniqueness Constraint

Identity Through Relationships (role through relationships)

The design pattern where a node’s role is encoded by the typed relationships it participates in, not by extra labels: a Person is an actor because it has an ACTED_IN relationship, a director because it has DIRECTED. It keeps the schema clean and avoids “label explosion.” Where it breaks: when a role needs role-specific properties (an actor’s per-film salary), which then have to live on the relationship and can get unwieldy at scale.

See also: Relationship, Schema-Optional, Label

Knowledge Graph (KG, property graph)

Data stored as interconnected entities — nodes joined by typed, directed relationships, both carrying properties. The “diagram is the data.” Its value for RAG is that retrieval can traverse connections rather than rely on embedding distance alone, surfacing related entities (investors, org charts, filings) that vector similarity cannot reach.

See also: Node, Relationship, Graph Traversal

Label (node label)

A tag applied to a node that groups it with similar entities — Person, Company, Movie. A node can have several labels, and labels enable efficient filtering in queries (MATCH (p:Person)). Distinct from a relationship type: labels classify nodes, types classify connections. Over-using labels for roles is an anti-pattern — prefer identity through relationships.

See also: Node, Identity Through Relationships

LOCATED_AT Relationship (LOCATED_AT)

A directed relationship from a Company or Manager to an Address node, connecting an entity to its physical location. It is the hop that makes geographic questions answerable: traverse LOCATED_AT to reach an entity’s coordinates, then compare with point.distance() to find what’s nearby. The Expand step of the M7 Extract-Enhance-Expand cycle.

See also: Address Node, Company Node, Manager Node

Manager Node (Manager node, investment manager)

A graph node representing an institutional investment-management firm, keyed by its managerCik and carrying name and address. It connects to Company nodes via OWNS_STOCK_IN relationships that record share count, value, and reporting quarter. Manager names are also indexed with a full-text index, so a user can find a firm by string (“Royal Bank”) even without its CIK.

See also: OWNS_STOCK_IN Relationship, Company Node, Full-Text Index

MATCH Clause (MATCH)

Cypher’s primary read clause: it specifies a graph pattern in ASCII-art notation, and the engine finds every matching subgraph and binds variables to the matched elements. Filter inline by label ((m:Movie)) or property ({name: "Tom"}), or add richer conditions with a WHERE clause. MATCH is the start of nearly every query.

See also: WHERE Clause, Multi-Hop Traversal, RETURN Clause

MERGE (MERGE clause)

The Cypher clause that creates a node or relationship only if it does not already exist (MATCH-then-create in one atomic step). It is the idempotent default for graph construction: re-processing the same data leaves the graph unchanged. Pair it with a uniqueness constraint and ON CREATE SET / ON MATCH SET for initial-vs-updated properties — and always MERGE on a stable identifier, since a volatile property (a changing timestamp) makes MERGE create a fresh element each run.

See also: CREATE, Idempotent Pipeline, Uniqueness Constraint

Multi-Hop Pattern (multi-hop traversal, multi-relationship pattern)

A Cypher pattern that chains several relationships to express an indirect connection — e.g. (a)-[:X]->(b)-[:Y]->(c). Matching subgraphs across multiple hops is the traversal a graph does cheaply and a relational database does expensively (one self-JOIN per hop). Multi-hop patterns are how graph-RAG reaches context several relationships away from the initial match.

See also: Cypher, Graph Traversal

Multi-Hop Traversal (multi-hop query, two-hop)

Following a chain of two or more relationships to reach indirect connections — co-actors, co-directors, friends-of-friends. The shared middle node (a)-[:R]->(b)<-[:R]-(c) is the graph’s equivalent of a SQL JOIN, written as a visual pattern. Cheap where SQL needs nested self-JOINs, but it explodes combinatorially on high-degree nodes (a 50-actor movie yields $50^2$ co-actor pairs), so bound it with LIMIT or filters.

See also: MATCH Clause, Graph Traversal

Neo4j Graph (LangChain) (Neo4jGraph, kg.query)

LangChain’s wrapper class for Neo4j connections. Constructed with a url, username, and password (loaded from environment variables, not hard-coded), it exposes a query() method that sends a Cypher string to the database and returns results as a list of Python dictionaries, handling connection pooling and authentication. It is the bridge between application code and the graph used throughout the course.

See also: Cypher, MATCH Clause

Neo4jVector (Neo4jVector.from_existing_graph)

LangChain’s vector-store interface for Neo4j — it makes the graph look like a standard vector store, running Cypher similarity search internally. from_existing_graph connects to an already-built index by naming the index_name, the node_label (e.g. Chunk), the text property (text_node_properties), and the embedding property (embedding_node_property); calling .as_retriever() yields a retriever for a RAG chain. Currency: it now lives in the dedicated langchain-neo4j package (2024-11), not langchain_community.

See also: Vector Index, RetrievalQA, Neo4j Graph (LangChain)

NEXT Relationship (NEXT, chunk linked list)

A directed relationship connecting sequential chunks within the same form section, forming a singly-linked list that preserves document order. Built by matching chunks whose chunkSeqId differ by one and that share formId and item — the section filter is mandatory, or the list bleeds across section boundaries. Created with MERGE for idempotency. NEXT is what makes sequential traversal and chunk-window retrieval possible; the same pattern models chat threads, audit trails, and event streams.

See also: Chunk Window, Variable-Length Path, Relationship

Node (vertex)

A data record representing an entity in a knowledge graph — written (...) in Cypher. It carries one or more labels (for grouping) and zero or more properties (key-value pairs). Equivalent to a graph-theory vertex, but “node” is preferred for data modelling. A Person node, a Company node, a Movie node.

See also: Relationship, Label, Property

OWNS_STOCK_IN Relationship (OWNS_STOCK_IN)

A directed relationship from a Manager to a Company representing a stock holding, with properties for share count, monetary value, and reporting quarter. One manager can hold multiple OWNS_STOCK_IN edges to the same company across different quarters, so the reporting period is part of the relationship’s identity (it is MERGE’d on reportCalendarOrQuarter). In the course data, 561 of these edges all point at NetApp — the investment side of the multi-hop chunk → form → company → manager traversal.

See also: Manager Node, Company Node, FILED Relationship

PART_OF Relationship (PART_OF)

A directed relationship from a Chunk to its parent Form (Chunk -[:PART_OF]-> Form), establishing document membership. It lets you traverse up from any chunk back to its source filing — and from there onward to connected entities (companies, investors) in later modules. The general pattern (member -[:PART_OF]-> container) generalises to any document hierarchy: chunk → section → chapter → book.

See also: Form Node, SECTION Relationship, Relationship

point.distance() (point distance, geodesic distance)

A Neo4j function that returns the geodesic (great-circle) distance between two geospatial points, in metres. Used in a WHERE clause for radius queries — WHERE point.distance(a.location, b.location) < 10000 finds entities within 10 km — and divided by 1000 for kilometres. The unit is the classic trap: 10 km is 10000, not 10; 50 miles is 80467. Pair it with a point index, or the query degrades to an O(n) scan of every candidate node.

See also: Geospatial Point, Address Node

Progressive Few-Shot Learning (progressive few-shot, incremental examples)

The practice of adding few-shot examples incrementally, where each example unlocks a new query capability the LLM generalises from — a city-filter example enables filtering for any entity, a point.distance() example enables geospatial search, a full-text-plus-SECTION example enables document navigation. The guiding principle is diversity over quantity: two or three examples spanning different query shapes (filter, aggregate, traverse) outperform ten variations of one.

See also: Few-Shot Prompt, GraphCypherQAChain

Property (property key-value)

A key-value pair stored on a node or a relationship (keys are strings; values are strings, numbers, booleans, or lists) — e.g. {name: "Andreas", born: 1975}. Crucially, relationships carry properties too: a since year on a WORKS_AT describes the connection itself, which is part of why a relationship is more than a graph-theory edge.

See also: Node, Relationship

Property Graph (labeled property graph)

The data model Neo4j implements: nodes and relationships that both carry properties, with labels grouping nodes and types+direction on relationships carrying semantics. It contrasts with the RDF triple model (subject–predicate–object); the property-graph model is what makes a relationship a rich record rather than a bare link, and underlies every example in this guide.

See also: Node, Relationship, Knowledge Graph

Query Parameter (Cypher parameter, params)

A value passed to a Cypher query via the params argument and referenced with a $ prefix (e.g. $openAiApiKey). It keeps secrets and user input out of the query string — no injection, no leakage into logs or the query cache — and lets Neo4j cache the execution plan across calls with different values. The one exception: DDL statements (CREATE INDEX, CREATE CONSTRAINT) can’t be parameterised, so sanitise those inputs manually.

See also: genai.vector.encode

Relationship (edge)

A directed, typed connection between two nodes — written -[:TYPE]-> in Cypher. It is itself a rich record: a start and end node, a type (ACTED_IN, OWNS_STOCK_IN), a direction, and optional properties. Preferred over the graph-theory term “edge” because it conveys richness beyond a bare link — type and direction are the semantics.

See also: Node, Identity Through Relationships, Property

Retrieval-Augmented Generation (RAG)

A system architecture that retrieves relevant context from an external store and injects it into the LLM prompt before generation, grounding the response in evidence and reducing hallucination. Basic RAG retrieves by vector similarity; graph-RAG adds traversal-based retrieval, so context can follow connections, not just semantic closeness.

See also: Vector Similarity Search, Graph Traversal

RetrievalQA (retrieval-QA chain, RAG chain)

LangChain’s question-answering chain that wires a retriever (vector search) to an LLM: the retriever finds relevant chunks, and the "stuff" chain type injects them into the prompt as context for the LLM to answer from. A custom PromptTemplate controls behaviour — crucially, the refusal instruction (“say you don’t know”) that curbs hallucination. Currency: RetrievalQA is deprecated since LangChain 0.1.17 (removal repeatedly deferred; still importable in the 1.x line); the current replacement is create_retrieval_chain with create_stuff_documents_chain. The connect → retrieve → generate concept is unchanged.

See also: Neo4jVector, Retrieval-Augmented Generation, Hallucination

RETURN Clause (RETURN, projection)

The Cypher clause that projects results: only the properties and aggregations it names are extracted and returned (RETURN p.name, count(m)). It is the final stage of execution, and returning whole nodes (RETURN p) versus specific properties (RETURN p.name) is a bandwidth/clarity choice — project just what the caller needs.

See also: Cypher Execution Model, MATCH Clause

Schema-Optional (schema flexibility)

A knowledge graph’s property of allowing new node labels and relationship types without DDL migrations — flexible where a relational schema is fixed and requires ALTER TABLE. The caveat: optional, not free. Without naming standards and uniqueness constraints the graph becomes a tangled mess, so production knowledge graphs still need governance even without an enforced schema.

See also: Knowledge Graph, Label

Score Threshold (similarity threshold)

A minimum similarity score below which results are discarded, so weak context never reaches the LLM (where it would invite hallucination). There is no universal value — calibrate against the corpus’s own score distribution: raise it for precision (customer-facing), lower it for recall (research). Because scores are relative, a threshold tuned on one corpus rarely transfers to another.

See also: Similarity Score

SEC Form 10-K (10-K, Form 10-K, annual report)

The annual report public companies file with the U.S. Securities and Exchange Commission — long, standardised business text that makes excellent knowledge-graph source data. The sections that matter for construction are Item 1 (business description), Item 1A (risk factors), Item 7 (management discussion & analysis), and Item 7A (market-risk disclosures). Raw filings are XML; pre-processing extracts these sections plus the CIK and CUSIP identifiers into JSON for ingestion.

See also: EDGAR, Chunk Metadata

SEC Form 13 (Form 13, 13F)

A filing made quarterly by institutional investment-management firms reporting their holdings of public-company stock. Each record carries the manager’s identity (name, CIK, address), the company’s identity (via CUSIP), and the position (share count, monetary value, reporting quarter). In this guide, Form 13 is the second dataset: connecting it to the existing Form 10-K graph turns documents into a graph of who invests in whom, answering questions vector search alone cannot.

See also: CUSIP, OWNS_STOCK_IN Relationship, SEC Form 10-K

SECTION Relationship (SECTION)

A directed relationship from a Form to the first chunk (sequence id 0) of each section, carrying an f10kItem property that names the section (e.g. "item1"). It gives the form indexed entry points into the document: a query can jump straight to where a section starts, then follow NEXT through its content — hierarchical navigation without scanning the whole linked list.

See also: Form Node, PART_OF Relationship, NEXT Relationship

Semantic Search (meaning-based search)

Retrieval by meaning rather than exact words: encode the query and the corpus into the same embedding space and rank by similarity, so “movies about love” matches a tagline that never contains the word “love.” It is the capability a vector index adds to the graph — and it complements (does not replace) graph traversal, which finds connected entities rather than similar text.

See also: Embedding, db.index.vector.queryNodes

Similarity Score (vector score)

The number vector search returns for each result — for cosine, 0 (unrelated) to 1 (identical direction) — used to rank results and to filter weak ones. Crucially, scores are relative to the embedding model and corpus: 0.85 on movie taglines is not the same quality as 0.85 on SEC filings. Read the distribution (a clear winner vs a flat band) rather than trusting an absolute value.

See also: Cosine Similarity, Score Threshold

Text Chunking (chunking, text splitting)

Splitting a long document into smaller, overlapping segments so each fits an embedding model and maps to one retrievable unit. Chunk size sets granularity (too large → generic embeddings, weak precision; too small → fragments that lose context); overlap carries context across boundaries so a fact split between two chunks survives in at least one. The number of chunks is roughly $\lceil (L-O)/(S-O) \rceil$ for length $L$ , size $S$ , overlap $O$ — the effective advance is size − overlap. The size is a tuning parameter; the splitting strategy (fixed-size vs structure-aware/recursive) is a design decision driven by the document’s shape.

See also: Chunk Metadata, Embedding

Uniqueness Constraint (unique constraint)

A Neo4j constraint — CREATE CONSTRAINT ... FOR (p:Post) REQUIRE p.postId IS UNIQUE — that guarantees at most one node per key value and creates an implicit index in the process. It underpins idempotent, MERGE-based ingestion: correctness (no duplicates) and performance (the index speeds the MERGE existence check). Part of the production idempotent-ingest pattern.

See also: MERGE, Idempotent Pipeline

Variable Naming (Cypher) (meaningful variable names)

The practice of naming Cypher pattern variables for what they hold — tom, coActor, m for Movie — rather than n1/n2. It has zero runtime cost (the engine ignores the names) but a large readability payoff: a multi-hop pattern with meaningful names is self-documenting, while cryptic names force the next reader to decode it. A maintainability investment, part of “design readable queries” (KGR-2.5).

See also: MATCH Clause

Variable-Length Path (variable-length relationship, *0..1)

A Cypher pattern that matches a range of relationship hops rather than a fixed number: *0..1 matches zero or one hop, *1..3 matches one to three, *0..2 zero to two. It is the standard fix for boundary conditions where a fixed-length pattern returns nothing — e.g. a chunk window at the first or last chunk of a linked list, where the missing side simply matches zero hops. Pairing it with ORDER BY length(window) DESC LIMIT 1 keeps the longest window available at any position.

See also: Chunk Window, NEXT Relationship, Graph Traversal

Vector Index (Neo4j vector index)

A Neo4j index for approximate nearest-neighbour search over a node property that holds embeddings, created with a fixed dimensionality and similarity function (CREATE VECTOR INDEX … OPTIONS {indexConfig: {…}}). Two hard requirements: the declared dimension must match the embedding model’s output exactly, and the index must reach ONLINE status (not POPULATING) before searches return complete results.

See also: Embedding, Cosine Similarity, db.index.vector.queryNodes

Vector Similarity Search (vector search, similarity search)

Retrieval that encodes text as high-dimensional embeddings and finds nearest neighbours by a distance metric (cosine). Fast and effective for semantic matching — “what text is similar to my question?” — but it cannot follow structural relationships between entities. That gap is exactly what graph traversal fills in a graph-RAG system.

See also: Graph Traversal, Retrieval-Augmented Generation

WHERE Clause (WHERE)

Filters matched patterns by conditions beyond inline equality — comparisons (>, <, =), boolean logic (AND/OR/NOT), string matching, and regular expressions. Rule of thumb: use inline {...} property matching for simple equality, and WHERE for ranges or complex logic. MATCH (m:Movie) WHERE m.released > 2000 is the canonical example.

See also: MATCH Clause