Program: Intel Unnati Industrial Training (Jan 2025)

Duration: Jan 2025

Technologies: React, TypeScript, Vite, PDF.js, IndexedDB, Groq AI, Google Gemini, Perplexity, Anthropic Claude, Tailwind CSS, shadcn/ui, idb

Category: AI-Powered Document Intelligence, Enterprise Software, Knowledge Management

Description

IntelliRead is a local-first, browser-only PDF intelligence platform developed as part of the Intel Unnati Industrial Training program. It transforms enterprise PDFs into searchable, structured knowledge using AI-powered analysis, eliminating the need for backend servers or external databases.

Organizations manage thousands of PDFs—manuals, reports, policies, research papers—but finding specific information across these documents is time-consuming. IntelliRead solves this by making PDFs queryable: it preserves document structure, extracts tables, analyzes images with AI, and enables semantic search with natural language Q&A powered by multiple AI providers.

Built with 100% client-side processing using IndexedDB for local storage, IntelliRead ensures complete data privacy—your documents never leave your device. All PDF parsing, text extraction, chunking, and indexing happens in the browser, with AI providers called only for image descriptions and answering questions.

Screenshots

IntelliRead Main Interface

Main interface showing document upload and processing

PDF Document Viewer

Document viewer with sections, tables, and image tabs

AI Chat Interface

AI-powered Q&A interface with cited answers and page references

API Settings Configuration

API configuration modal for Groq, Gemini, Perplexity, and Anthropic

Document Processing Pipeline

Real-time processing progress with stage indicators

Key Features

Intelligent Document Processing

AI-Powered Search & Q&A

Privacy-First Architecture

Supported PDF Types

Text-based PDFs

Standard PDFs with selectable text. Direct text extraction via PDF.js with full section and table detection.

Mixed PDFs

Documents combining text and images. Hybrid extraction with AI image analysis for charts and diagrams embedded in text content.

Image-only PDFs

Scanned documents or photo PDFs. Full page rendering to PNG with Gemini AI descriptions for searchable content.

Chart-heavy PDFs

Data visualizations and technical diagrams. AI-powered chart interpretation with data point extraction.

Technical Implementation

System Architecture

Processing Pipeline

Stage 1: PDF Upload & Analysis
  • File loaded into browser memory using FileReader API
  • PDF.js parses document structure and extracts metadata
  • Total page count, outline, and document properties determined
Stage 2: Page-by-Page Extraction
  • Text extraction using page.getTextContent() for each page
  • Image detection via operator list analysis
  • Page classification: TEXT, IMAGE_ONLY, or MIXED
Stage 3: Image Processing (if applicable)
  • Image-only pages rendered to PNG at 2x scale
  • Images sent to Gemini API for detailed descriptions
  • AI-generated descriptions stored as searchable text
Stage 4: Content Normalization
  • Text and image descriptions merged into unified page array
  • Table detection and extraction with row/column integrity
  • Section heading identification using regex patterns
Stage 5: Smart Chunking
  • Content split at sentence boundaries (never mid-word)
  • Target chunk size: 800 characters; max: 1200 characters
  • Metadata attached: documentId, sectionTitle, pageStart/End
Stage 6: IndexedDB Storage
  • Documents, sections, chunks, tables, and images stored in separate stores
  • Indexed by documentId for fast retrieval
  • Status updated to "indexed" when complete

Search & Retrieval

Challenges & Solutions

Challenge: Making scanned PDFs and image content searchable without OCR backend

Solution: Implemented AI-powered image description using Google Gemini. Image-only pages are rendered to PNG and sent to Gemini API, which generates detailed text descriptions. These descriptions are indexed as searchable text, making visual content fully queryable.
Challenge: Maintaining document structure and context during chunking

Solution: Developed smart chunking algorithm that splits content at sentence boundaries while preserving section metadata. Each chunk carries documentId, sectionTitle, and page range, ensuring retrieved content maintains its original context when displayed to users.
Challenge: Supporting multiple AI providers with different API formats

Solution: Created unified API client with provider-specific adapters. Each provider (Groq, Gemini, Perplexity, Anthropic) implements a consistent interface, allowing seamless switching while handling provider-specific authentication, rate limits, and response formats.
Challenge: Ensuring privacy while leveraging AI capabilities

Solution: Adopted local-first architecture with 100% client-side processing. Documents are stored in IndexedDB and never uploaded to servers. AI providers receive only necessary data (image blobs for description, text chunks for Q&A), and users provide their own API keys for complete control.
Challenge: Handling large PDFs without performance degradation

Solution: Implemented progressive processing with visual progress indicators. Pages are processed in batches, images are analyzed in parallel (max 3 concurrent), and IndexedDB transactions are batched for efficiency. Large documents are handled incrementally to prevent memory issues.

Project Structure

Frontend Components
  • /components/ui: shadcn/ui base components (Button, Card, Dialog, Tabs, etc.)
  • /components/Header.tsx: App header with navigation and settings access
  • /components/PDFViewer.tsx: Document viewer with tabs for Content, Tables, and Images
  • /components/ChatInterface.tsx: Q&A chat panel with provider selection
  • /components/DocumentLibrary.tsx: Document list and management UI
  • /components/APISettingsModal.tsx: API key configuration per provider
Core Libraries
  • pdfProcessor.ts: Orchestrates PDF ingestion, text extraction, image detection, and chunking
  • textChunker.ts: Splits content at sentence boundaries with metadata preservation
  • imageExtractor.ts: Detects image-only pages and renders them to PNG blobs
  • apiClient.ts: Unified interface for all AI providers (Groq, Gemini, Perplexity, Anthropic)
  • vectorSearch.ts: Keyword search and chunk retrieval with relevance scoring
  • db.ts: IndexedDB wrapper using idb library for CRUD operations
State Management (Custom Hooks)
  • useAPIKeys: Load, save, and validate API keys from IndexedDB
  • useChat: Manage chat messages, send queries, handle AI responses
  • useDocuments: CRUD operations for documents in IndexedDB
IndexedDB Schema
  • documents: Document metadata (id, title, pageCount, wordCount, status)
  • sections: Extracted sections with page ranges and content
  • chunks: Text chunks with metadata for retrieval
  • tables: Extracted tables with row/column data
  • images: Image metadata and AI-generated descriptions
  • chatHistory: Conversation history per document per provider
  • apiSettings: API keys for each provider

Learning Outcomes

Technical Skills Developed

Enterprise Software Principles

Intel Unnati Program Insights

Performance Metrics

Processing Speed
  • Text-only PDF (50 pages): 15-30 seconds
  • Mixed content PDF (50 pages): 30-60 seconds
  • Image-only PDF (50 pages): 2-5 minutes (dependent on AI API latency)
Storage Efficiency
  • Text-only documents: ~1000 documents (100 pages each) in typical browser storage
  • Mixed content: ~500 documents
  • Chunk size optimization: 800 characters target, 1200 max (optimal for semantic coherence)
Search Performance
  • Keyword search: <100ms for documents with 5000+ chunks
  • Context retrieval: Top-5 chunks retrieved in <50ms
  • AI response time: 1-3 seconds (Groq), 2-5 seconds (Claude/Perplexity)

Future Enhancements

Advanced Features

AI Capabilities

Enterprise Features

← Back to Portfolio