Flashcards

Study the glossary as flashcards: shuffled, one term at a time — recall the definition, flip to check, and sort each card into "knew it" / "still learning".

69 cards · 0 marked known

Address Node

A graph node representing a physical location — city, state, and a geospatial location property (a latitude/longitude point). Connected to Company and Manager nodes via LOCATED_AT relationships, it adds the geography dimension to the graph and enables distance-based discovery with point.distance(). Address nodes are themselves an instance of Extract-Enhance-Expand: extract from CSV, enhance with a geospatial point, expand with LOCATED_AT.

Chunk Metadata

The structured fields stored alongside a chunk’s text that preserve its origin, position, and identity: section (item), sequence position (chunkSeqId), the parent filing (formId), a unique key (chunkId), a source link, and entity ids (cik, cusip6). Metadata is what lets you rebuild document structure and link chunks to graph entities later — chunkSeqId in particular preserves in-section order so Module 5 can wire chunks into a NEXT linked list. You cannot reconstruct provenance you didn’t capture at ingest, so design the schema before loading.

Chunk Window

A retrieval pattern that returns a target chunk plus its neighbours in the NEXT linked list, giving the LLM continuous context instead of one isolated chunk. A window of w=2k+1w = 2k+1 chunks uses kk hops in each direction (default k=1k=1 → a 3-chunk window). Built with variable-length paths (*0..k) so it degrades gracefully at list boundaries. The payoff is concrete: expanded context surfaces details an adjacent chunk held (e.g. a product name) that the single best-matching chunk missed — improving completeness with no change to the embedding model.

Co-Located Vector Store

Storing embeddings as node properties in the same database as the graph (Neo4j) rather than in a separate vector database. The win: one query can combine vector similarity with graph traversal — retrieve a chunk and follow its relationships — in a single round trip, with one system to operate. The trade-off: you depend on Neo4j’s embedding/ANN support, where a best-of-breed vector DB may offer more algorithms and easier model swaps for large standalone vector workloads.

Company Node

A graph node representing a public company, keyed by its cusip6 and created idempotently with MERGE. It is the hub that joins the two datasets: a FILED edge links it to the Form 10-K it filed (matched on shared CUSIP), and incoming OWNS_STOCK_IN edges link it to every Manager that holds its stock. Building Company nodes on the CUSIP key is what lets the independently created Form 13 and Form 10-K data connect with no manual matching.

Cosine Similarity

A similarity metric measuring the angle between two vectors, ignoring magnitude: cosθ=ABAB\cos\theta = \frac{A \cdot B}{\lVert A\rVert\,\lVert B\rVert}, ranging −1 to 1 (Neo4j’s vector-index score normalises it to 0–1). It is OpenAI’s recommended function, and for normalised vectors it ranks results identically to Euclidean distance — so prefer Euclidean only when vector magnitude carries meaning (e.g. raw TF-IDF). The standard choice for embedding search.

CREATE

The Cypher clause that unconditionally creates a node or relationship — even if an identical one already exists. It is fast (no existence check), but dangerous for relationships in retry-prone pipelines, where re-running the same CREATE silently inserts duplicates. Reserve it for guaranteed-unique insertions; use MERGE everywhere a query might re-run.

CUSIP

Committee on Uniform Security Identification Procedures — a 9-character code uniquely identifying a financial instrument; the first 6 characters (CUSIP 6) identify the issuing company. Because both the Form 10-K and Form 13 datasets carry it, CUSIP 6 is the linking key that joins independently created Company and Form nodes automatically (WHERE com.cusip6 = f.cusip6). It is the worked example of cross-dataset linking via a shared universal identifier — the pattern that fails (needing entity resolution) only when datasets use different identifier schemes.

Custom Retrieval Query

A Cypher query passed to Neo4jVector via the retrieval_query argument that extends the default vector search. It runs after the index match, receiving the matched node and its score, and can perform extra graph traversal before returning context to the LLM — for example following NEXT to build a chunk window. This is the hook that fuses vector search (find the entry point) with graph traversal (expand context) into a single retrieval step.

Cypher

Neo4j’s declarative, pattern-matching query language. It uses ASCII-art notation that mirrors a whiteboard diagram — () for nodes, [] for relationships, -> for direction — so you draw the pattern you want and Cypher returns every matching subgraph. Self-documenting: MATCH (p:Person)-[:ACTED_IN]->(m:Movie) reads as the graph structure it matches.

Cypher Execution Model

The four stages a Cypher query passes through: parse (string → abstract syntax tree), plan (the optimiser chooses indexes, scan order, join strategy), execute (traverse the graph hop by hop, binding variables), and project (the RETURN clause extracts only the requested properties/aggregations). EXPLAIN shows the plan without running; PROFILE runs it and reports actual per-step row counts — the tools for tuning a slow query.

db.create.setNodeVectorProperty

Neo4j’s procedure that writes a computed embedding onto a node as a vector property — the storage step that follows genai.vector.encode. Persisting the vector on the node (rather than in a separate store) is exactly what makes the data co-located, enabling a single query to later combine similarity search with graph traversal.

db.index.vector.queryNodes

Neo4j’s procedure for top-K approximate nearest-neighbour search over a vector index. It takes the index name, a topK count, and a query vector, and YIELDs the matched nodes with a similarity score. It is the search half of graph-RAG retrieval: encode the question (with genai.vector.encode), then queryNodes to find the closest stored embeddings.

DELETE

The Cypher clause that removes matched nodes or relationships (returns no data). It cannot delete a node that still has relationships — Neo4j protects referential integrity — so you must delete the relationships first, or use DETACH DELETE to remove the node and its relationships together.

DETACH DELETE

A DELETE variant that removes a node and all of its relationships in a single step, satisfying referential integrity without manually deleting each connecting relationship first. It is the safe, idiomatic way to drop a connected node — a plain DELETE on such a node fails.

EDGAR

Electronic Data Gathering, Analysis, and Retrieval — the SEC’s public database for corporate filings, including Form 10-K and Form 13. It is the open access point for the raw documents that seed a financial knowledge graph: every filing is keyed by the company’s CIK (Central Index Key), the identifier that links a company across all of its filings.

Embedding

A dense vector representation of text in high-dimensional space, where semantically similar texts have high cosine similarity. OpenAI’s default model outputs 1536-dimensional vectors. The embedding model is the ceiling of retrieval quality: if it encodes two related concepts as distant vectors, no index tuning recovers the match — which is why domain-specific corpora sometimes need fine-tuned or specialised models.

Embedding Dimensionality

The number of dimensions in an embedding vector (1536 for OpenAI’s default model). Higher dimensionality captures more semantic nuance but costs more storage and search time. The key operational constraint: a vector index’s declared dimension must match the embedding model’s output exactly — a mismatch (index 768 vs vectors 1536) makes a similarity query fail with a dimension-mismatch error.

Extract-Enhance-Expand

A repeatable three-phase pattern for growing a knowledge graph incrementally, applied at every stage of this course. Extract: pull source data into nodes (text → Chunks; CSV → Company / Manager / Address nodes). Enhance: add indexes or computed properties (vector embeddings, full-text indexes, geospatial points). Expand: connect the new nodes into the existing graph with relationships (NEXT, PART_OF, FILED, OWNS_STOCK_IN, LOCATED_AT). Its one assumption — that new data links to existing nodes by a shared identifier — is exactly where it needs an entity-resolution step when identifiers are absent or ambiguous.

Few-Shot Prompt

A prompt containing a handful of worked example pairs that teach an LLM a pattern by demonstration rather than instruction. For Cypher generation, each example pairs a natural-language question with its correct query, plus the injected schema and a “use only these types; do not hallucinate” instruction — the LLM then generalises to unseen questions. Two to three diverse examples (a filter, an aggregation, a traversal) cover far more ground than many similar ones; past that, returns diminish.

FILED Relationship

A directed relationship from a Company to a Form (Company -[:FILED]-> Form) recording that the company filed that SEC document. It is created by matching Company and Form nodes on their shared cusip6 — the concrete realisation of cross-dataset linking. Traversed backward from a form (Form <-[:FILED]- Company), it is the bridge hop that connects document content to the company and, onward, to that company’s investors.

Form Node

A graph node representing an SEC filing document, carrying the metadata its chunks share — formId, source (URL back to the filing), cik, and cusip6. It is the parent that chunks connect up to via PART_OF, and the entry point that links down to the first chunk of each section via SECTION. Its cusip6 property is the bridge to Module 6, where Company nodes join forms through the same CUSIP identifier.

Full-Text Index

A Neo4j index optimised for string matching — exact, partial, and fuzzy keyword search with relevance scoring — created with CREATE FULLTEXT INDEX ... FOR (n:Label) ON EACH [n.prop] and queried via db.index.fulltext.queryNodes. It complements the other two paradigms: a vector index matches meaning (semantic similarity), a property index does exact lookup by a known id, and full-text matches spelling. Having all three in one engine means keyword and semantic retrieval need no separate search infrastructure.

genai.vector.encode

Neo4j’s built-in function that calls an external embedding API (e.g. OpenAI) from inside Cypher: it takes the text, a provider name ("OpenAI"), and a config map (with the API token passed as a query parameter), and returns a float-array embedding. Paired with db.create.setNodeVectorProperty to store the vector on the node — generating and persisting embeddings without leaving the database.

Geospatial Point

A Neo4j data type representing a location on Earth as latitude/longitude, stored with point({ latitude: x, longitude: y }) and held on a node property (e.g. Address.location). It is the input to point.distance() for proximity queries, and a point index on the property keeps range searches performant. A common bug is reversing latitude and longitude — verify coordinate order against the source data.

Graph Traversal

Retrieval by following typed relationships from a starting node to connected nodes — the graph analogue of adjacency. It answers “what is connected to this?”, surfacing connection-based context (a company’s investors, a document’s sections) that similarity search misses. Cost is roughly O(kd)O(k^d) in the local fan-out (kk = avg connections, dd = hops), versus dd self-JOINs over a whole table in SQL.

Graph-to-Text Context Generation

Converting structured graph-traversal results into natural-language sentences an LLM can read as context — turning a relationship triple (subject–predicate–object) into a human-readable statement like “Royal Bank of Canada owns 1.2M shares of NetApp”. It is how graph data enters a RAG prompt: the traversal supplies precise, connected facts, and the sentence form makes them digestible to the model. Module 7 inverts the direction — the LLM generates the query instead of consuming the result.

GraphCypherQAChain

LangChain’s chain that turns a natural-language question into a graph answer end to end: the LLM generates Cypher (guided by the injected schema and few-shot examples), Neo4j executes it, and the LLM synthesises the rows into a natural-language reply — question → Cypher → execute → answer. It needs the schema and a good few-shot prompt for reliable generation, and its main risk is schema hallucination (invented relationship types), mitigated by schema injection and validating the generated query before running it.

Hallucination

When an LLM produces plausible but factually wrong output. In RAG it has two distinct failure modes: (1) invention — the model fabricates facts absent from the retrieved context; and (2) mis-attribution — it applies entity A’s retrieved context to entity B (e.g. answering a question about Apple using NetApp’s chunks, because those were the nearest vectors available). Mitigations layer: prompt instructions (“use ONLY this context”; “say you don’t know”), scoping the context to named sources, and retrieval-score gating so out-of-scope questions return no context. Prompt engineering helps but is not foolproof — pair it with a score threshold.

Idempotent Pipeline

A graph-construction pipeline that produces the same graph no matter how many times it runs — the property retry-prone ingestion needs, since data arrives from unreliable sources (API retries, replays, duplicate messages). Achieved with a uniqueness constraint + MERGE on a stable identifier + ON CREATE SET/ON MATCH SET, giving exactly-once semantics. The production standard for all graph construction.

Identity Through Relationships

The design pattern where a node’s role is encoded by the typed relationships it participates in, not by extra labels: a Person is an actor because it has an ACTED_IN relationship, a director because it has DIRECTED. It keeps the schema clean and avoids “label explosion.” Where it breaks: when a role needs role-specific properties (an actor’s per-film salary), which then have to live on the relationship and can get unwieldy at scale.

Knowledge Graph

Data stored as interconnected entitiesnodes joined by typed, directed relationships, both carrying properties. The “diagram is the data.” Its value for RAG is that retrieval can traverse connections rather than rely on embedding distance alone, surfacing related entities (investors, org charts, filings) that vector similarity cannot reach.

Label

A tag applied to a node that groups it with similar entitiesPerson, Company, Movie. A node can have several labels, and labels enable efficient filtering in queries (MATCH (p:Person)). Distinct from a relationship type: labels classify nodes, types classify connections. Over-using labels for roles is an anti-pattern — prefer identity through relationships.

LOCATED_AT Relationship

A directed relationship from a Company or Manager to an Address node, connecting an entity to its physical location. It is the hop that makes geographic questions answerable: traverse LOCATED_AT to reach an entity’s coordinates, then compare with point.distance() to find what’s nearby. The Expand step of the M7 Extract-Enhance-Expand cycle.

Manager Node

A graph node representing an institutional investment-management firm, keyed by its managerCik and carrying name and address. It connects to Company nodes via OWNS_STOCK_IN relationships that record share count, value, and reporting quarter. Manager names are also indexed with a full-text index, so a user can find a firm by string (“Royal Bank”) even without its CIK.

MATCH Clause

Cypher’s primary read clause: it specifies a graph pattern in ASCII-art notation, and the engine finds every matching subgraph and binds variables to the matched elements. Filter inline by label ((m:Movie)) or property ({name: "Tom"}), or add richer conditions with a WHERE clause. MATCH is the start of nearly every query.

MERGE

The Cypher clause that creates a node or relationship only if it does not already exist (MATCH-then-create in one atomic step). It is the idempotent default for graph construction: re-processing the same data leaves the graph unchanged. Pair it with a uniqueness constraint and ON CREATE SET / ON MATCH SET for initial-vs-updated properties — and always MERGE on a stable identifier, since a volatile property (a changing timestamp) makes MERGE create a fresh element each run.

Multi-Hop Pattern

A Cypher pattern that chains several relationships to express an indirect connection — e.g. (a)-[:X]->(b)-[:Y]->(c). Matching subgraphs across multiple hops is the traversal a graph does cheaply and a relational database does expensively (one self-JOIN per hop). Multi-hop patterns are how graph-RAG reaches context several relationships away from the initial match.

Multi-Hop Traversal

Following a chain of two or more relationships to reach indirect connections — co-actors, co-directors, friends-of-friends. The shared middle node (a)-[:R]->(b)<-[:R]-(c) is the graph’s equivalent of a SQL JOIN, written as a visual pattern. Cheap where SQL needs nested self-JOINs, but it explodes combinatorially on high-degree nodes (a 50-actor movie yields 50250^2 co-actor pairs), so bound it with LIMIT or filters.

Neo4j Graph (LangChain)

LangChain’s wrapper class for Neo4j connections. Constructed with a url, username, and password (loaded from environment variables, not hard-coded), it exposes a query() method that sends a Cypher string to the database and returns results as a list of Python dictionaries, handling connection pooling and authentication. It is the bridge between application code and the graph used throughout the course.

Neo4jVector

LangChain’s vector-store interface for Neo4j — it makes the graph look like a standard vector store, running Cypher similarity search internally. from_existing_graph connects to an already-built index by naming the index_name, the node_label (e.g. Chunk), the text property (text_node_properties), and the embedding property (embedding_node_property); calling .as_retriever() yields a retriever for a RAG chain. Currency: it now lives in the dedicated langchain-neo4j package (2024-11), not langchain_community.

NEXT Relationship

A directed relationship connecting sequential chunks within the same form section, forming a singly-linked list that preserves document order. Built by matching chunks whose chunkSeqId differ by one and that share formId and item — the section filter is mandatory, or the list bleeds across section boundaries. Created with MERGE for idempotency. NEXT is what makes sequential traversal and chunk-window retrieval possible; the same pattern models chat threads, audit trails, and event streams.

Node

A data record representing an entity in a knowledge graph — written (...) in Cypher. It carries one or more labels (for grouping) and zero or more properties (key-value pairs). Equivalent to a graph-theory vertex, but “node” is preferred for data modelling. A Person node, a Company node, a Movie node.

OWNS_STOCK_IN Relationship

A directed relationship from a Manager to a Company representing a stock holding, with properties for share count, monetary value, and reporting quarter. One manager can hold multiple OWNS_STOCK_IN edges to the same company across different quarters, so the reporting period is part of the relationship’s identity (it is MERGE’d on reportCalendarOrQuarter). In the course data, 561 of these edges all point at NetApp — the investment side of the multi-hop chunk → form → company → manager traversal.

PART_OF Relationship

A directed relationship from a Chunk to its parent Form (Chunk -[:PART_OF]-> Form), establishing document membership. It lets you traverse up from any chunk back to its source filing — and from there onward to connected entities (companies, investors) in later modules. The general pattern (member -[:PART_OF]-> container) generalises to any document hierarchy: chunk → section → chapter → book.

point.distance()

A Neo4j function that returns the geodesic (great-circle) distance between two geospatial points, in metres. Used in a WHERE clause for radius queries — WHERE point.distance(a.location, b.location) < 10000 finds entities within 10 km — and divided by 1000 for kilometres. The unit is the classic trap: 10 km is 10000, not 10; 50 miles is 80467. Pair it with a point index, or the query degrades to an O(n) scan of every candidate node.

Progressive Few-Shot Learning

The practice of adding few-shot examples incrementally, where each example unlocks a new query capability the LLM generalises from — a city-filter example enables filtering for any entity, a point.distance() example enables geospatial search, a full-text-plus-SECTION example enables document navigation. The guiding principle is diversity over quantity: two or three examples spanning different query shapes (filter, aggregate, traverse) outperform ten variations of one.

Property

A key-value pair stored on a node or a relationship (keys are strings; values are strings, numbers, booleans, or lists) — e.g. {name: "Andreas", born: 1975}. Crucially, relationships carry properties too: a since year on a WORKS_AT describes the connection itself, which is part of why a relationship is more than a graph-theory edge.

Property Graph

The data model Neo4j implements: nodes and relationships that both carry properties, with labels grouping nodes and types+direction on relationships carrying semantics. It contrasts with the RDF triple model (subject–predicate–object); the property-graph model is what makes a relationship a rich record rather than a bare link, and underlies every example in this guide.

Query Parameter

A value passed to a Cypher query via the params argument and referenced with a $ prefix (e.g. $openAiApiKey). It keeps secrets and user input out of the query string — no injection, no leakage into logs or the query cache — and lets Neo4j cache the execution plan across calls with different values. The one exception: DDL statements (CREATE INDEX, CREATE CONSTRAINT) can’t be parameterised, so sanitise those inputs manually.

Relationship

A directed, typed connection between two nodes — written -[:TYPE]-> in Cypher. It is itself a rich record: a start and end node, a type (ACTED_IN, OWNS_STOCK_IN), a direction, and optional properties. Preferred over the graph-theory term “edge” because it conveys richness beyond a bare link — type and direction are the semantics.

Retrieval-Augmented Generation

A system architecture that retrieves relevant context from an external store and injects it into the LLM prompt before generation, grounding the response in evidence and reducing hallucination. Basic RAG retrieves by vector similarity; graph-RAG adds traversal-based retrieval, so context can follow connections, not just semantic closeness.

RetrievalQA

LangChain’s question-answering chain that wires a retriever (vector search) to an LLM: the retriever finds relevant chunks, and the "stuff" chain type injects them into the prompt as context for the LLM to answer from. A custom PromptTemplate controls behaviour — crucially, the refusal instruction (“say you don’t know”) that curbs hallucination. Currency: RetrievalQA is deprecated since LangChain 0.1.17 (removal repeatedly deferred; still importable in the 1.x line); the current replacement is create_retrieval_chain with create_stuff_documents_chain. The connect → retrieve → generate concept is unchanged.

RETURN Clause

The Cypher clause that projects results: only the properties and aggregations it names are extracted and returned (RETURN p.name, count(m)). It is the final stage of execution, and returning whole nodes (RETURN p) versus specific properties (RETURN p.name) is a bandwidth/clarity choice — project just what the caller needs.

Schema-Optional

A knowledge graph’s property of allowing new node labels and relationship types without DDL migrations — flexible where a relational schema is fixed and requires ALTER TABLE. The caveat: optional, not free. Without naming standards and uniqueness constraints the graph becomes a tangled mess, so production knowledge graphs still need governance even without an enforced schema.

Score Threshold

A minimum similarity score below which results are discarded, so weak context never reaches the LLM (where it would invite hallucination). There is no universal value — calibrate against the corpus’s own score distribution: raise it for precision (customer-facing), lower it for recall (research). Because scores are relative, a threshold tuned on one corpus rarely transfers to another.

SEC Form 10-K

The annual report public companies file with the U.S. Securities and Exchange Commission — long, standardised business text that makes excellent knowledge-graph source data. The sections that matter for construction are Item 1 (business description), Item 1A (risk factors), Item 7 (management discussion & analysis), and Item 7A (market-risk disclosures). Raw filings are XML; pre-processing extracts these sections plus the CIK and CUSIP identifiers into JSON for ingestion.

SEC Form 13

A filing made quarterly by institutional investment-management firms reporting their holdings of public-company stock. Each record carries the manager’s identity (name, CIK, address), the company’s identity (via CUSIP), and the position (share count, monetary value, reporting quarter). In this guide, Form 13 is the second dataset: connecting it to the existing Form 10-K graph turns documents into a graph of who invests in whom, answering questions vector search alone cannot.

SECTION Relationship

A directed relationship from a Form to the first chunk (sequence id 0) of each section, carrying an f10kItem property that names the section (e.g. "item1"). It gives the form indexed entry points into the document: a query can jump straight to where a section starts, then follow NEXT through its content — hierarchical navigation without scanning the whole linked list.

Similarity Score

The number vector search returns for each result — for cosine, 0 (unrelated) to 1 (identical direction) — used to rank results and to filter weak ones. Crucially, scores are relative to the embedding model and corpus: 0.85 on movie taglines is not the same quality as 0.85 on SEC filings. Read the distribution (a clear winner vs a flat band) rather than trusting an absolute value.

Text Chunking

Splitting a long document into smaller, overlapping segments so each fits an embedding model and maps to one retrievable unit. Chunk size sets granularity (too large → generic embeddings, weak precision; too small → fragments that lose context); overlap carries context across boundaries so a fact split between two chunks survives in at least one. The number of chunks is roughly (LO)/(SO)\lceil (L-O)/(S-O) \rceil for length LL, size SS, overlap OO — the effective advance is size − overlap. The size is a tuning parameter; the splitting strategy (fixed-size vs structure-aware/recursive) is a design decision driven by the document’s shape.

Uniqueness Constraint

A Neo4j constraint — CREATE CONSTRAINT ... FOR (p:Post) REQUIRE p.postId IS UNIQUE — that guarantees at most one node per key value and creates an implicit index in the process. It underpins idempotent, MERGE-based ingestion: correctness (no duplicates) and performance (the index speeds the MERGE existence check). Part of the production idempotent-ingest pattern.

Variable Naming (Cypher)

The practice of naming Cypher pattern variables for what they holdtom, coActor, m for Movie — rather than n1/n2. It has zero runtime cost (the engine ignores the names) but a large readability payoff: a multi-hop pattern with meaningful names is self-documenting, while cryptic names force the next reader to decode it. A maintainability investment, part of “design readable queries” (KGR-2.5).

Variable-Length Path

A Cypher pattern that matches a range of relationship hops rather than a fixed number: *0..1 matches zero or one hop, *1..3 matches one to three, *0..2 zero to two. It is the standard fix for boundary conditions where a fixed-length pattern returns nothing — e.g. a chunk window at the first or last chunk of a linked list, where the missing side simply matches zero hops. Pairing it with ORDER BY length(window) DESC LIMIT 1 keeps the longest window available at any position.

Vector Index

A Neo4j index for approximate nearest-neighbour search over a node property that holds embeddings, created with a fixed dimensionality and similarity function (CREATE VECTOR INDEX … OPTIONS {indexConfig: {…}}). Two hard requirements: the declared dimension must match the embedding model’s output exactly, and the index must reach ONLINE status (not POPULATING) before searches return complete results.

WHERE Clause

Filters matched patterns by conditions beyond inline equality — comparisons (>, <, =), boolean logic (AND/OR/NOT), string matching, and regular expressions. Rule of thumb: use inline {...} property matching for simple equality, and WHERE for ranges or complex logic. MATCH (m:Movie) WHERE m.released > 2000 is the canonical example.