Pukslagaren Digital Archive Project: A Complete RAG Pipeline
Overview
This project transformed 45 years of a Swedish newspaper’s staff magazine (Pukslagaren, 1968-2012) from unsearchable scanned PDFs into a semantic search system powered by AI. The end result is a Custom GPT that can answer natural language questions about the archive’s contents.
The Problem
We had 132 PDF files containing scanned magazine pages. The issues:
- Poor text layer quality – The existing OCR was garbled (e.g., “§tÅffiÅmEN” instead of “STAFFAN”)
- No semantic search – Traditional keyword search fails when you don’t know exact terms
- Size constraints – 2.9 GB total, far exceeding ChatGPT’s 512 MB knowledge base limit
- No structured metadata – Couldn’t filter by year or issue number
The Solution: A 5-Stage Pipeline
Stage 1: PDF Organization
132 individual PDFs (quarterly/biannual issues)
↓
Organized by year and issue number
↓
Naming pattern: "Pukslagaren YYYY-N.pdf"
Stage 2: Optical Character Recognition (OCR)
Tool: ocrmypdf with Tesseract (Swedish language pack)
ocrmypdf --force-ocr --language swe input.pdf output.pdf
pdftotext output.pdf extracted.txt
Why --force-ocr? The PDFs had existing (broken) text layers. This flag ignores them and performs fresh OCR on the images.
Output: 132 text files, ~25,000-47,000 characters each
Stage 3: Text Chunking
Large documents must be split into smaller pieces for embedding:
Full document (~40,000 chars)
↓
Chunks of ~1,500 characters
↓
200-character overlap between chunks
↓
~30 chunks per document
Why overlap? Prevents information loss at chunk boundaries. A sentence split across two chunks will appear complete in at least one of them.
Result: 4,126 total chunks from 132 documents
Stage 4: Vector Embeddings
Tool: OpenAI Embeddings API (text-embedding-ada-002)
Text chunk → OpenAI API → 1536-dimensional vector
Each chunk becomes a point in 1536-dimensional space where semantically similar texts are located near each other.
Example: “personaltidning” (staff magazine) and “företagets nyhetsbrev” (company newsletter) would be close in vector space, even though they share no words.
Stage 5: Vector Database
Tool: Supabase with pgvector extension
Schema:
CREATE TABLE pukslagaren_chunks (
id bigserial PRIMARY KEY,
content text NOT NULL, -- The actual text
source_file text NOT NULL, -- Original PDF filename
year integer, -- 1968-2012
issue text, -- "1", "2", "sommaren", "julen"
chunk_index integer, -- Position in document
embedding vector(1536), -- The vector representation
created_at timestamptz
);
-- Enable fast similarity search
CREATE INDEX ON pukslagaren_chunks
USING ivfflat (embedding vector_cosine_ops);
Search function:
CREATE FUNCTION search_pukslagaren(
query_embedding vector(1536),
match_count int DEFAULT 5,
filter_year int DEFAULT NULL
)
Uses cosine similarity to find the closest vectors to the query.
The Search Flow (RAG)
RAG = Retrieval-Augmented Generation
User question: "Vad skrev Pukslagaren om Vietnamkriget?"
↓
[1] Embed the question
↓
Query vector (1536 dimensions)
↓
[2] Search vector database (cosine similarity)
↓
Top 5 most relevant text chunks
↓
[3] Send chunks + question to GPT-4
↓
"Enligt Pukslagaren 1972-3 skrev man..."
Why RAG instead of fine-tuning?
- No hallucination – answers come from actual documents
- Traceable sources – can cite specific issues
- Updatable – add new documents without retraining
- Cost-effective – embeddings are cheap, fine-tuning is expensive
The API Layer
Supabase Edge Function (Deno/TypeScript):
POST /search-pukslagaren
{
"query": "Vietnamkriget",
"year": 1972, // optional filter
"match_count": 5
}
The function:
- Receives the search query
- Calls OpenAI to embed the query
- Searches Supabase with the embedding
- Returns matching chunks with metadata
Cloudflare Worker (proxy for clean URLs):
api.jardenberg.se/pukslagaren/search-pukslagaren
↓ proxies to ↓
wnioekumgtuppvuqsbsu.supabase.co/functions/v1/search-pukslagaren
The Custom GPT Integration
OpenAPI Specification tells ChatGPT how to call our API:
openapi: 3.0.0
info:
title: Pukslagaren Archive Search
servers:
- url: https://api.jardenberg.se/pukslagaren
paths:
/search-pukslagaren:
post:
operationId: searchPukslagaren
parameters:
- name: query
- name: year
- name: match_count
GPT Instructions:
Du är en expert på Pukslagarens arkiv (1968-2012).
När användaren ställer en fråga:
1. Använd searchPukslagaren för att hitta relevant innehåll
2. Sammanfatta informationen på svenska
3. Ange alltid källan (år och nummer)
Architecture Diagram
│ Custom GPT │
│ “Pukslagaren Arkiv” │
└─────────────────────┬───────────────────────────────────────┘
│ OpenAPI Action
▼
┌─────────────────────────────────────────────────────────────┐
│ Cloudflare Worker (Proxy) │
│ api.jardenberg.se/pukslagaren/* │
└─────────────────────┬───────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Supabase Edge Function │
│ search-pukslagaren │
│ ┌─────────────┐ ┌──────────────────┐ │
│ │ Embed query │───▶│ Vector similarity │ │
│ │ (OpenAI) │ │ search (pgvector) │ │
│ └─────────────┘ └──────────────────┘ │
└─────────────────────┬───────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Supabase Database │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ pukslagaren_chunks │ │
│ │ • 4,126 rows │ │
│ │ • content, year, issue, embedding │ │
│ │ • ivfflat index for fast similarity search │ │
│ └────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Key Technologies Used
| Layer | Technology | Purpose |
|---|---|---|
| OCR | ocrmypdf + Tesseract | Extract text from scanned images |
| Embeddings | OpenAI text-embedding-ada-002 | Convert text to vectors |
| Vector DB | Supabase + pgvector | Store and search embeddings |
| API | Supabase Edge Functions (Deno) | Serverless search endpoint |
| Proxy | Cloudflare Workers | Custom domain routing |
| Interface | OpenAI Custom GPT | Natural language chat interface |
Results
- Input: 2.9 GB of scanned PDFs with broken text
- Output: Semantic search across 45 years of content
- Chunks: 4,126 searchable text segments
- Latency: ~500ms for a full search query
- Cost: Minimal (embeddings ~$0.10/1M tokens, Supabase free tier)