Building a GraphRAG System – Core Infrastructure & Document Ingestion

Part 1: A deep dive into multi-database architecture, AI-powered entity extraction, and intelligent document processing

Introduction

Traditional RAG systems rely solely on vector embeddings for semantic search, but this approach has fundamental limitations. As I discussed in my earlier post “Beyond Vector Search: How GraphRAG Enables Smarter AI Responses”, vector-based systems struggle with entity-focused queries like “What did Microsoft say about AI safety?” or “Show me all papers by authors who work on neural architecture search” because they lack understanding of relationships between concepts, organizations, and people.

GraphRAG addresses these limitations by combining knowledge graphs with vector embeddings and thus, creating a document intelligence system that understands both semantic similarity and structural relationships.

This is Part 1 of a multi-part blog that will document development of a GraphRAG document repository server built to support AI agents or a Chat application that require retrieval of additional information from repository corpus to perform an actions or generate response. Users can upload a PDF file or submit a web page by URL to create a document that is then undergoes a series of automated steps in the ingestion pipeline – token-based chunking, named entities and entity relationships extraction, embeddings generation. Resulting objects are store in a multi-database architecture that provides both semantic and graph-based search capabilities.

The Part 1 focus is on architectural decisions made and approaches taken to develop document ingestion pipeline and AI-powered named entities and relationships extraction that provide the foundation layer of the GraphRAG system back-end.

System Architecture Overview

The most critical architectural decision was using three specialized databases instead of forcing everything into a single system. Each database is optimized for a specific type of data and query pattern.

High-Level Architecture

Multi-Database Strategy

Neo4j (Knowledge Graph)

Purpose: Store documents, text chunks, entities, and authors as graph nodes
Strengths: Relationship traversal, pattern matching, entity connections
Query Types: “Find documents by authors who also wrote about X”, “Show entity relationships”
Why: Graph databases excel at relationship-heavy queries that would require complex JOINs in SQL

ChromaDB (Vector Database)

Purpose: Store high-dimensional embeddings (1024 dimensions) for semantic search
Strengths: Fast approximate nearest neighbor search, semantic similarity
Query Types: “Find conceptually similar content”, semantic search
Why: Specialized vector databases provide orders of magnitude better performance than storing vectors in general-purpose databases

MinIO (Object Storage)

Purpose: Store original files, extracted metadata, processing logs
Strengths: Scalable file storage, S3-compatible API, presigned URLs
Query Types: Direct file access, metadata retrieval
Why: Object storage is purpose-built for files, with features like versioning and presigned URLs

Why Not a Single Database?

PostgreSQL with pgvector: Good for smaller projects but graph traversal becomes complex and slow at scale. SQL JOINs can’t match Neo4j’s native graph traversal performance.

Neo4j alone: Can store vectors as node properties, but vector similarity search is significantly slower than specialized vector databases like ChromaDB or Pinecone.

Vector database alone: Excellent for semantic search but can’t efficiently answer relationship queries like “documents by authors who cite this work” or build entity networks.

The three-database approach provides the best tool for each job. While it adds operational complexity, the performance gains and query flexibility justify the investment.

Design Philosophy

Separation of Concerns: Each service layer has a single, well-defined responsibility. Processing services orchestrate workflows, extraction services handle content, AI services integrate external APIs, and graph services manage database operations.

Service-Oriented Architecture: Services communicate through well-defined interfaces, making it easy to swap implementations (e.g., replace Claude with another LLM, swap ChromaDB for Pinecone).

Async Operations: Document processing involves I/O-heavy operations (API calls, database writes). Asynchronous processing enables concurrent operations and better resource utilization.

Dependency Injection: Database clients and services are injected as dependencies, facilitating testing and enabling different configurations for development vs. production.

Current Technology Stack

REST API Framework – FastAPI is a modern, fast (high-performance), web framework for building APIs with Python based on standard Python type hints.

Modern async web framework with automatic API documentation
Native support for async / await patterns
Type safety with Pydantic models (Pydantic is a data validation library for Python)

Graph Database – Neo4j Community Edition a high-performance graph database

Industry-leading graph database with Cypher query language
ACID transactions for data consistency
Flexible schema evolution

Vector Database – Chroma, an open-source vector database with vector, full-text, regex, and metadata search support

Lightweight wrappers around popular embedding providers like OpenAPI or Cohere
Object storage as a shared layer for query nodes which resolve user queries
Compactor nodes, which asynchronously build indexes and persist them to object storage
Good performance for development and medium-scale production

Object Storage – MinIO, a high-performance S3-compatible object store

MinIO AIStor contains every component required to run large scale data infrastructure
Self-hosted for development
Allows easy migration to cloud S3

AI Services:

Anthropic Claude 3.5 Haiku: Entity extraction with structured outputs (~100ms per chunk)
VoyageAI voyage-3: High-quality 1024-dimensional embeddings
VoyageAI rerank-2: Result re-ranking

Processing Libraries:

PyMuPDF: Fast PDF text extraction
BeautifulSoup4: HTML parsing and web scraping
tiktoken: Token-aware text chunking

Document Processing Pipeline

At present, you can add a new document to the repository by uploading a PDF file to MinIO object storage or by submitting a web page URL. The document processing pipeline transforms content of raw files and web pages into a rich knowledge graph with semantic embeddings. The pipeline consists of seven stages:

File upload / Web page content scrapping
Plain text extraction from PDF file or page content
Token-aware text chunking (max chunk size is 1000 tokens with 15% overlap)
Named entity extraction from documents (Author, Person, Organization, Location, Technology, Concept, Product and Event)
Knowledge Graph update (add new Document node and its Chunk nodes; add Author nodes and Entity nodes of Person; finally, add edges to define relevant relationships between new and existing nodes).
Generate embedding vectors for plain text stored for each Chunk node and the store embeddings in vector DB (Chroma)

Stage 1: Add New Document to the Repository

Currently, GraphRAG document repository supports two options for adding new documents:

Upload a PDF file (multipart file upload):

Calculate MD5 hash for de-duplication and check if document already exists
Store new binary file in MinIO object storage
Generate unique document ID

Submit a web page URL:

Validate URL format and fetch content with appropriate headers
Archive HTML content and store it in MinIO object storage
Generate unique document ID

Key Considerations for adopting this approach:

De-duplication: Hash-based detection prevents processing the same document twice
File Validation: Size limits (e.g., 50MB) and format checks prevent malicious content
Unique IDs: UUIDs ensure globally unique document identifiers

Stage 2: Plain Text Extraction from Documents

Next step after new document is added to repository, is the plain text extraction from it to allow named entity extraction (NPE) from it and chunking for generating embeddings.

Text extraction from a PDF file (PyMuPDF – a Python library for working with PDF files):

Extract text from all pages preserving document structure where possible
Extract PDF metadata (such as author, title, creation date, page count)
Handle various PDF versions and encodings
Error handling for corrupted or password-protected files

Key Features of PyMuPDF library:

Fast processing (3-5x faster than alternatives) of large size PDF files
High accuracy on complex document layouts
Support for rich metadata extraction

Text extraction from a Web Page (web scraping using BeautifulSoup4 – a Python library for parsing HTML and XML documents):

Parse HTML structure and remove boilerplate (scripts, styles, navigation, ads) from it
Extract main content area (article, main tag, or heuristics)
Parse meta tags (author, description, keywords) added to a web page
Clean and normalize extracted text

Key Features of BeautifulSoup4 library:

Intelligent content detection
Metadata extraction from HTML meta tags
Handles various HTML structures, including ‘messy’ HTML pages

There are still common Challenges:

JavaScript-heavy sites may require headless browser
Some sites have anti-scraping measures
Content structure varies widely across sites

Stage 3: Extracted Text Chunking

Text extracted from a document must be split into manageable chunks to allow both LLM processing and embedding generation. Adopted chunking strategy significantly impacts system performance and accuracy.

In GraphRAG repository we use a Token-Based Chunking Strategy:

Use BPE (Byte Pair Emcoding) tokenizer to convert text into tokens and then split it into chunks based on tokens count (not characters) for precise control over chunk size
Default chunk size is 1024 tokens (~700-800 words)
Chunks are created with a 15% chunk size overlap
Preserve semantic boundaries are used for splitting wherever possible

Why Token-Based Chunking?:

Data Type Compatibility: Embeddings API and LLM work with tokens, not characters
Consistent Size: Character-based chunking can produce variable token counts per chunkl
Semantic Preservation: Token boundaries often align with word boundaries

Overlap Rationale: 15% overlap ensures context isn’t lost at chunk boundaries. A sentence split between chunks will appear complete in at least one chunk.

Chunk ID Generation: Generate unique ID for a chunk using: MD5 hash (doc_id + text + chunk_index + offset). Including the document ID is critical as it prevents collisions when multiple documents contain similar content.

Stage 4: Entity Extraction (Multi-Pass Pipeline)

Entity extraction is the most sophisticated part of the document ingestion pipeline. The challenge is to accurately identify document authors and organizations they worked in at the time article was written separately from people or organizations mentioned in the document text. The pipeline logic is based on the position in a document of the text where authors and their organizations are typically mentioned”

In a scientific article (PDF files that solution targets), both Author names and their Organizations are typically fount in the article header i.e., at the very beginning of the first chunk in a document
In a blog or news publication (web pages that solution targets), Author names and possibly, their Organizations can be found either in the article header, or footer i.e., at the bottom of the last chunk in a document.

To address this challenge, the following multi-pass pipeline was implemented for the named entity extraction from a document added to the repository:

PASS 1: Use Claude AI to extract entities of Person, Organization, Location, Technology, Concept, Product and Event types from all chunks in a document
PASS 2: Use Claude AI to extract Author and their Organization entities:
- PDF file: Extract authors and their organization from the first 800 characters in the first chunk
- Web page: Extract authors and their organizations from the first 500 characters in the first chunk. If not found, try last 500 characters in the last chunk.
- Fallback to using document metadata, if PASS 2 extraction fails.
PASS 3: Use Claude AI to extract semantic relationships between entities e.g., Author A collaborated with Person B or Organization C uses Technology T.
PASS 4: Use statistical proximity analysis to define ‘mentioned with’ relationship between entities that appear frequently together in a document.

After all 4 passes are successfully completed, create corresponding nodes and edges (define relationship) in the Knowledge Graph (Neo4j). Rollback entity and relationship creation if any pass failed due to an error to prevent creating semantic inconsistencies in the graph.

PASS 1: General Entity Extraction

All text chunks in a document are processed by sending them in parallel (3-5 chunks at a time) to Claude AI with a custom prompt tailored for named entities. extraction. Custom prompt contains detailed instructions explaining Claude what entities should be extracted from the text. To help with identifying entities, it has several examples of text with entities targeted for extraction. Prompt also describes the format of expected response – JSON structure with entity name, type, confidence score (0.0 – 1.0), and description.

The following Entity Types are currently supported for extraction:

Person: People mentioned in the chunk text
Organization: Companies, institutions, or universities
Location: Cities, countries, regions
Technology: Frameworks, programming languages, algorithms
Concept: Theoretical ideas, methodologies, approaches
Product: Software products, platforms, services
Event: Conferences, releases, historical events

De-duplication: after entities extraction from all chunks is completed, found entities are de-duplicated by normalizing their names (e.g., GPT-4″, “gpt-4”, “GPT 4” -> all become “gpt-4”). Only entities with highest confidence score are kept.

PASS 2: Specialized Author & Their Organization Extraction

Author extraction has a multi-step pipeline. Step 1: first chunk in a document is sent to Claude AI with a custom prompt tailored for extracting Authors (i.e., person who wrote the document) and Organizations they worked at when article was written. Apart from entity description and extraction instructions, prompt contains several text snippets showing how authors and their organization can appear in different document. Prompt uses the same response format as for other Entities – JSON structure with entity name, type, confidence score (0.0 – 1.0), and description.

To help Claude with locating authors and organizations they worked in, only first 800 characters (configurable) are sent for processing.

Step 2: if Step 1 didn’t find authors for a blog or an article published on a web page, pipeline tries extracting authors and their organizations from the last 500 characters of the last chunk in the document. The reason for doing that is that web publications often place authors right after the blog or article content.

Fallback to Document Metadata

If neither Step 1, nor Step 2 returned any authors, pipeline process makes “last resort” attempt to extract authors and their organization from PDF file metadata or web page meta-tags (OG, Twitter, etc).

The following Confidence Scores are assigned to extracted Author and Organization entities depending on the PASS:

Step 1 or 2: Text extraction: 0.90-0.99 (high confidence)
Step 3: Metadata extraction: 0.70-0.85 (medium confidence)
Fallback: Using Unknown tag: 0.50 (low confidence)

Why Create “Unknown” Author?: Every document must have an author for relationship consistency.

PASS 3: Semantic Relationships

After extraction of supported Entities and Authors is completed, pipeline process proceeds to extracting semantic relationships from the document chunks. Relationships are extracted using Claude AI request with a highly-structured and constraints-based prompt.

The following Semantic Relationship types are extracted from a document:

RELATIONSHIP_TYPES = {
    "WORKED_AT": {
        "description": "Person/Author worked at Organization",
        "source_types": ["Person", "Author"],
        "target_types": ["Organization"],
        "symmetric": False
    },
    "RELATED_TO": {
        "description": "General semantic relationship between entities",
        "source_types": ["*"],  # Any entity type
        "target_types": ["*"],
        "symmetric": True,
        "properties": ["context", "relationship_type", "confidence"]
    },
    "LOCATED_IN": {
        "description": "Entity is physically located in a Location",
        "source_types": ["Person", "Organization", "Event", "Product"],
        "target_types": ["Location"],
        "symmetric": False,
        "properties": ["start_date", "end_date", "context"]
    },
    "COLLABORATED_WITH": {
        "description": "Collaboration between authors or persons",
        "source_types": ["Author", "Person"],
        "target_types": ["Author", "Person"],
        "symmetric": True,
        "properties": ["project", "document_id", "confidence"]
    },
    "PART_OF": {
        "description": "Hierarchical membership or composition",
        "source_types": ["Person", "Organization", "Location", "Concept"],
        "target_types": ["Organization", "Location", "Concept"],
        "symmetric": False,
        "properties": ["role", "context"]
    },
    "CREATED": {
        "description": "Entity created another entity (product, technology)",
        "source_types": ["Person", "Organization"],
        "target_types": ["Product", "Technology", "Concept"],
        "symmetric": False,
        "properties": ["date", "context"]
    },
    "USES": {
        "description": "Entity uses technology or product",
        "source_types": ["Person", "Organization"],
        "target_types": ["Technology", "Product"],
        "symmetric": False
    },
    "IMPLEMENTS": {
        "description": "Technology implements concept",
        "source_types": ["Technology", "Product"],
        "target_types": ["Concept"],
        "symmetric": False
    },
    "PARTICIPATED_IN": {
        "description": "Entity participated in an event",
        "source_types": ["Person", "Organization", "Author"],
        "target_types": ["Event"],
        "symmetric": False,
        "properties": ["role", "context"]
    },
    "OCCURRED_IN": {
        "description": "Event occurred in location",
        "source_types": ["Event"],
        "target_types": ["Location"],
        "symmetric": False,
        "properties": ["date", "context"]
    },
    "AFFILIATED_WITH": {
        "description": "Professional affiliation",
        "source_types": ["Person", "Author"],
        "target_types": ["Organization"],
        "symmetric": False,
        "properties": ["role", "start_date", "end_date"]
    },
    "MENTIONED_WITH": {
        "description": "Co-occurrence in same context (chunk)",
        "source_types": ["*"],
        "target_types": ["*"],
        "symmetric": True,
        "properties": ["chunk_id", "frequency", "confidence"]
    },
    "FOUNDED": {
        "description": "Person founded organization",
        "source_types": ["Person", "Author"],
        "target_types": ["Organization"],
        "symmetric": False,
        "properties": ["date", "context"]
    }
}

PASS 4: Co-occurrence Relationships

In PASS 4 entities are first grouped by chunk to identify which entities appear in each text chunk and to create pairwise relationships for every pair of entities that appear in the same chunk. A confidence score is calculated for each pair based on the individual entity confidence scores. For pairs with high confidence score MENTIONED_WITH relationship for a corresponding nodes is defined in Neo4j.

The co-occurrence relationship creates a layer in the knowledge graph that can be used for graph traversal, clustering, or as input for more advanced relationship inference.

Multi-Pass System Benefits

Accuracy: Separating author and their organization extraction from general entity extraction dramatically improves precision. Authors are identified based on document structure (headers or footers) rather than being confused with people emtioned in the document.

Confidence Tracking: Different extraction methods have different confidence levels, enabling confidence-based filtering and ranking.

Stage 5: Knowledge Graph Update

Step 1: Create Document and Chunk nodes

For each document added to repository a corresponding Document node is added to the Knowledge Graph (.Neo4j). Document title, source type (PDF or webpage), list of author names, etc. are stored as the node properties.

For each chunk generated for the document, a Chunk node with text content and metadata is added to the Graph. The following relationships are defined between Document node and Chunk nodes:

(Document)–[:CONTAINS]–>(Chunk 1), (Chunk 2), …, (Last Chunk)
(Chunk 1)–[:NEXT_CHUNK]–>(Chunk 2)–[NEXT_CHUNK]–> … –>(Last Chunk)

Step 2: Create Entity Nodes

Create or merge Entity nodes for each entity type (Person, Organization, Location, Technology, Concept, Product, and Event) extracted form the document.The following relationships are defined between Document, Chunk and Entity nodes:

(Entity)–[:APPEARS_IN]–>(Document)
(Entity)–[:MENTIONED_IN]–>(Chunk) for all chunks it is found.in.

Step 3: Create Author Nodes

Create or merge Author and Entity nodes of Organization type extracted from the document in the PASS 2 of extraction pipeline. The following relationships are defined between the nodes:

(Document) — [:AUTHORED_BY] –> (Author)
(Author) — [:WORKED_AT] –> (Organization)
(Author)–[:APPEARS_IN]–>(Document)
(Author)–[:MENTIONED_IN]–>(Chunk)

Step 4: Define Semantic Relationships

For Author and Entities of 7 supported types extracted from the document create or merge discovered semantic relationships following relationship rules described above:

(Entity) -[:RELATED_TO]-> (Entity)
(Entity) -[:COLLABORATED_WITH]-> (Entity)
(Entity) -[:LOCATED_IN]-> (Entity)
(Entity) -[:CREATED]-> (Entity)
(Entity) -[:USES]-> (Entity)
(Entity) -[:PART_OF]-> (Entity)
(Entity) -[:AFFILIATED_WITH]-> (Entity)
(Entity) -[:WORKED_AT]-> (Entity)
(Entity) -[:COMPETED_WITH]-> (Entity)
(Entity) -[:ACQUIRED]-> (Entity)
(Entity) -[:FUNDED]-> (Entity)

Step 5: Create Entity Co-occurrence Relationships

For Author and Entities of 7 supported types extracted from the document create or merge co-occurrence relationships found in PASS 4 extraction with high confidence:

(Entity) -[:MENTIONED_WITH]-> (Entity)

Note: Nodes and relationships are stored in Neo4j using MERGE (not CREATE) to ensure idem-potency and allow reprocessing documents without constraint violations.

Below is an example of Document, Chunks, Author and Entity nodes and edges/relationships created in Neo4j for an article Anthropic scientists hacked Claude’s brain — and it noticed. Here’s why that’s huge published by Venture Beat:

Blue – Document, Red – Chunks in the document, Orange – Author, and Yellow nodes – extracted Entities

Stage 6: Generate Embeddings

At this pipeline stage embedding vectors are generated for the plain text in each chunk that was stored in Neo4j for submitted document and then stored them in vector database Chroma.

Step 1: retrieve chunks with meaningful content (i.e., chunks with non-zero tokens count) from Neo4j for processing
Step2: check if embeddings for a chunk already exists in Chroma DB using plain text SHA-256 hash and skip unchanged chunks.
Step 3: processes chunks in batches for efficiency usingVoyage AI (“voyage-3” model) to generate embeddings vector with 1024 dimensions.
Step 4: store embedding in Chrome DB collection in the following format

{
      "ids": ["doc_id:chunk_index"],           # Unique identifiers
      "embeddings": [[1024 float values]],     # voyage-3 vectors
      "metadatas": [{
          "doc_id": "uuid",
          "chunk_index": int,
          "hash": "sha256_hash",               # For change detection
          "token_count": int,
          "created_at": "timestamp"
      }],
      "documents": ["chunk text content"]      # Full text for retrieval
  }

GraphRAG Server REST API Design

GraphRAG document repository server application exposes three categories of APIs – health monitoring, documents ingestion & management, and search.

Health & System Monitoring API

GET /api/health – Basic health check
GET /api/status – Comprehensive system status with database statistics

Documents Ingestion & Management API

Core APIs

POST /api/documents/upload – upload a PDF file to the document repository

Sample API response:

{
  "success": true,
  "document": {
    "id": "5994e9a3-169d-4db0-8f2a-6a2bba124498",
    "filename": "2210.03629v3.pdf",
    "object_name": "5994e9a3-169d-4db0-8f2a-6a2bba124498/2210.03629v3.pdf",
    "hash": "f285b0971ae4a790e402fb93966bed3adde2cf0a04977d08b2b40d6ab0cace69",
    "size": 633805,
    "content_type": "application/pdf",
    "source_type": "pdf",
    "presigned_url": "http://localhost:9000/documents/5994e9a3-169d-4db0-8f2a-6a2bba124498/2210.03629v3.pdf?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=minioadmin%2F20251116%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20251116T010358Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=49d1ebf8c112da411b8e0c5511b894336f9b7df10af6be65fc56d98ccf044390",
    "metadata": {
      "doc_id": "5994e9a3-169d-4db0-8f2a-6a2bba124498",
      "original_filename": "2210.03629v3.pdf",
      "source_type": "pdf",
      "file_hash": "f285b0971ae4a790e402fb93966bed3adde2cf0a04977d08b2b40d6ab0cace69",
      "upload_timestamp": "2025-11-16T01:03:58.598796",
      "processing": {
        "status": "completed",
        "chunks_created": 39,
        "processing_time": 1.276457,
        "text_statistics": {
          "page_count": 33,
          "total_characters": 110319,
          "total_words": 17108,
          "pages_with_text": 33
        },
        "graph_storage": {
          "status": "completed",
          "document_stored": true,
          "chunks_stored": true,
          "chunks_count": 39
        },
        "embeddings": {
          "status": "completed",
          "chunks_embedded": 39,
          "embeddings_stored": 39,
          "processing_time": 2.3364198207855225,
          "model": "voyage-3",
          "dimensions": 1024,
          "metadata": {
            "model": "voyage-3",
            "usage": {
              "total_tokens": 1950
            },
            "embedding_dimensions": 1024,
            "chroma_collection": "document_chunks"
          }
        }
      }
    },
    "created_at": "2025-11-16T01:03:58.608947"
  },
  "message": "Document uploaded and processed successfully",
  "duplicate": false
}

POST /api/documents/add-url – add a document by fetching content from a URL

Sample API response:

{
  "success": true,
  "document": {
    "id": "7773f9a8-5eaf-41ca-9048-608a36903bea",
    "filename": "venturebeat.com_9471f509-13ef-4dcc-aa88-e919618e640c.txt",
    "object_name": "7773f9a8-5eaf-41ca-9048-608a36903bea/venturebeat.com_9471f509-13ef-4dcc-aa88-e919618e640c.txt",
    "hash": "60c0523381abb79057e94e73b9d2d51654e290503fed2f2ca73625f4f64d42c5",
    "size": 43211,
    "content_type": "application/octet-stream",
    "source_type": "webpage",
    "presigned_url": "http://localhost:9000/documents/7773f9a8-5eaf-41ca-9048-608a36903bea/venturebeat.com_9471f509-13ef-4dcc-aa88-e919618e640c.txt?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=minioadmin%2F20251116%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20251116T003858Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-Amz-Signature=a59fb6f82e104409fc149dcffb7824ea0417436fefd2ed71000b0f260c69fc74",
    "metadata": {
      "doc_id": "7773f9a8-5eaf-41ca-9048-608a36903bea",
      "original_filename": "venturebeat.com_9471f509-13ef-4dcc-aa88-e919618e640c.txt",
      "source_type": "webpage",
      "file_hash": "60c0523381abb79057e94e73b9d2d51654e290503fed2f2ca73625f4f64d42c5",
      "upload_timestamp": "2025-11-16T00:38:58.906093",
      "title": "Databricks: 'PDF parsing for agentic AI is still unsolved' — new tool replaces multi-service pipelines with single function | VentureBeat",
      "source_url": "https://venturebeat.com/data-infrastructure/databricks-pdf-parsing-for-agentic-ai-is-still-unsolved-new-tool-replaces",
      "og_title": "Databricks: 'PDF parsing for agentic AI is still unsolved' — new tool replaces multi-service pipelines with single function",
      "twitter_title": "Databricks: 'PDF parsing for agentic AI is still unsolved' — new tool replaces multi-service pipelines with single function",
      "twitter_creator": "@venturebeat",
      "article_published_time": "2025-11-14T11:00-05:00",
      "language": "en",
      "content_type": "text/html; charset=utf-8",
      "last_modified": "",
      "server": "Vercel",
      "domain": "venturebeat.com",
      "scheme": "https",
      "final_url": "venturebeat.com",
      "scraped_at": "2025-11-16T00:36:18.233072",
      "processing": {
        "status": "completed",
        "chunks_created": 9,
        "processing_time": 0.893602,
        "text_statistics": {
          "total_characters": 43169,
          "total_words": 5713,
          "response_size": 132416,
          "status_code": 200
        },
        "graph_storage": {
          "status": "completed",
          "document_stored": true,
          "chunks_stored": true,
          "chunks_count": 9
        },
        "embeddings": {
          "status": "completed",
          "chunks_embedded": 9,
          "embeddings_stored": 9,
          "processing_time": 0.9918451309204102,
          "model": "voyage-3",
          "dimensions": 1024,
          "metadata": {
            "model": "voyage-3",
            "usage": {
              "total_tokens": 450
            },
            "embedding_dimensions": 1024,
            "chroma_collection": "document_chunks"
          }
        },
        "url": "https://venturebeat.com/data-infrastructure/databricks-pdf-parsing-for-agentic-ai-is-still-unsolved-new-tool-replaces"
      }
    },
    "created_at": "2025-11-16T00:38:58.910172"
  },
  "message": "Webpage processed and stored successfully",
  "duplicate": false
}

GET /api/documents – list document in repository with pagination / filtering support
GET /api/documents/{doc_id} – get metadata for a given document
DELETE /api/documents/{doc_id} – delete a document and all associated data from Neo4j, Chroma DB, and MinIO object storage
GET /api/documents/{doc_id}/authors – retrieve all authors for a given document
GET /api/documents/{doc_id}/chunks – get all text chunks for a a given document from graph DB (Neo4j)
POST /api/documents/search/similar – find chunks matching a query using vector similarity
GET /api/documents/authors – search for authors with optional filtering and pagination
GET /api/documents/authors/{author_name}/documents – retrieve all documents authored by a given author
POST /api/documents/{doc_id}/embeddings – generate or regenerate embeddings for all chunks of a document
GET /api/documents{doc_id}/chunk-embeddings – get detailed information about chunks with embeddings in Chroma DB
POST /api/documents/{doc_id/entities – extract entities from a document using Claude API
GET /api/documents/{doc_id}/entities – retrieve all entities extracted from a document
GET /api/documents/entities/types/{entity_type} – find entities of a specific type with pagination support
GET /api/documents/entities/{entity_id} – get detailed information about a specific entity
GET /api/documents/entities/{entity_id}/related – find entities related to a given entity through co-occurrence
GET /api/documents/relationships/types – Get all defined relationship types and their constraints
POST /api/documents/{doc_id}/relationships/extract – extract semantic relationships between entities in a specific document
GET /api/documents/entities/{entity_id/relationships – get all relationships for a specific entity with optional filtering
GET /api/documents/{doc_id}/relationships – get all relationships extracted from a specific document

Batch APIs

POST /api/documents/batch/embeddings – generate embeddings for a batch of documents
POST /api/documents/batch/entities – extract entities from multiple documents in batch

Debug and Statistics APIs

GET /api/documents/graph/statistics – get statistics about the knowledge graph data (documents, chunks, relationships)
GET /api/documents/authors/statistics – get statistics about authors in the system
GET /api/documents/embeddings/statistics – get statistics about embeddings across all documents in repository
GET /api/documents/entities/statistics – get statistics about extracted entities across all documents
GET /api/documents/relationships/statistics – get statistics about all relationships in the knowledge graph
GET /api/documents/{doc_id}/diagnostics – debug endpoint to check document-chunk-entity relationships in Neo4j

Search API

Core Search APIs

POST /api/search/query – main search endpoint supporting multiple search modes.
- Vector search: semantic similarity using embeddings
- Graph search: entity-based traversal and relationships
- Hybrid search: combination of vector and graph approaches
GET /api/search/modes – get available search modes and their descriptions.
POST /api/search/explain – analyze a query and provide explanation with mode suggestion
- Query complexity assessment
- Detected entities and concepts
- Recommended search mode
- Suggested mode weights for hybrid search

Search Statistics and Analytics APIs

GET /api/search/statistics – get search system statistics and analytics
GET /api/search/analytics/summary – get detailed analytics summary with trends and insights.
GET /api/search/analytics/trends – get trending queries and search patterns.
GET /api/search/analytics/performance – get performance insights and optimization recommendations.
GET /api/search/health – health check for search functionality.

Conclusion

In Part 1 of the series of blogs about GraphRAG document repository system I covered:

Multi Database Architecture: Neo4j knowledge graph for document and entity relationships , Chroma DB for semantic search, MinIO for file storage – each optimized for its specific purpose.
Intelligent Document Processing Pipeline: automatic document ingestion pipeline to transform raw PDF files or text extracted from a web page into rich knowledge graphs with semantic embeddings.
Multi Pass Entity Extraction: AI-powered named entities and their relationships extraction from documents added to the repository.
Knowledge Graph Model: Document centric schema with 7 entity types, optimized relationship directions, and sequential chunk linking.
Documents and Search APIs: core back-end server APIs for Health document ingestion and multi-mode (graph, vector and hybrid) search.

In subsequent parts of the blog I will talk about advanced AI-powered entity and relationship extraction from repository corpus, including global entity co-occurrence, cross-document relation extraction, and communities generation and then about, AI-powered document query generation in response to a natural-language user question. Stay tuned.

Building a GraphRAG System – Core Infrastructure & Document Ingestion

Part 1: A deep dive into multi-database architecture, AI-powered entity extraction, and intelligent document processing

Introduction

System Architecture Overview

High-Level Architecture

Multi-Database Strategy

Why Not a Single Database?

Design Philosophy

Current Technology Stack

Document Processing Pipeline

Stage 1: Add New Document to the Repository

Stage 2: Plain Text Extraction from Documents

Stage 3: Extracted Text Chunking

Stage 4: Entity Extraction (Multi-Pass Pipeline)

PASS 1: General Entity Extraction

PASS 2: Specialized Author & Their Organization Extraction

Fallback to Document Metadata

PASS 3: Semantic Relationships

PASS 4: Co-occurrence Relationships

Multi-Pass System Benefits

Stage 5: Knowledge Graph Update

Step 1: Create Document and Chunk nodes

Step 2: Create Entity Nodes

Step 3: Create Author Nodes

Step 4: Define Semantic Relationships

Step 5: Create Entity Co-occurrence Relationships

Stage 6: Generate Embeddings

GraphRAG Server REST API Design

Health & System Monitoring API

Documents Ingestion & Management API

Core APIs

Batch APIs

Debug and Statistics APIs

Search API

Core Search APIs

Search Statistics and Analytics APIs

Conclusion

2 Comments

Leave a comment Cancel reply

Part 1: A deep dive into multi-database architecture, AI-powered entity extraction, and intelligent document processing

Introduction

System Architecture Overview

High-Level Architecture

Multi-Database Strategy

Why Not a Single Database?

Design Philosophy

Current Technology Stack

Document Processing Pipeline

Stage 1: Add New Document to the Repository

Stage 2: Plain Text Extraction from Documents

Stage 3: Extracted Text Chunking

Stage 4: Entity Extraction (Multi-Pass Pipeline)

PASS 1: General Entity Extraction

PASS 2: Specialized Author & Their Organization Extraction

Fallback to Document Metadata

PASS 3: Semantic Relationships

PASS 4: Co-occurrence Relationships

Multi-Pass System Benefits

Stage 5: Knowledge Graph Update

Step 1: Create Document and Chunk nodes

Step 2: Create Entity Nodes

Step 3: Create Author Nodes

Step 4: Define Semantic Relationships

Step 5: Create Entity Co-occurrence Relationships

Stage 6: Generate Embeddings

GraphRAG Server REST API Design

Health & System Monitoring API

Documents Ingestion & Management API

Core APIs

Batch APIs

Debug and Statistics APIs

Search API

Core Search APIs

Search Statistics and Analytics APIs

Conclusion

Share this:

Related

2 Comments

Leave a comment Cancel reply