Pukslagaren Digital Archive Project: A Complete RAG Pipeline

Overview

This project transformed 45 years of a Swedish newspaper’s staff magazine (Pukslagaren, 1968-2012) from unsearchable scanned PDFs into a semantic search system powered by AI. The end result is a Custom GPT that can answer natural language questions about the archive’s contents.

The Problem

We had 132 PDF files containing scanned magazine pages. The issues:

Poor text layer quality – The existing OCR was garbled (e.g., “§tÅffiÅmEN” instead of “STAFFAN”)
No semantic search – Traditional keyword search fails when you don’t know exact terms
Size constraints – 2.9 GB total, far exceeding ChatGPT’s 512 MB knowledge base limit
No structured metadata – Couldn’t filter by year or issue number

The Solution: A 5-Stage Pipeline

Stage 1: PDF Organization

132 individual PDFs (quarterly/biannual issues)
    ↓
Organized by year and issue number
    ↓
Naming pattern: "Pukslagaren YYYY-N.pdf"

Stage 2: Optical Character Recognition (OCR)

Tool: ocrmypdf with Tesseract (Swedish language pack)

ocrmypdf --force-ocr --language swe input.pdf output.pdf
pdftotext output.pdf extracted.txt

Why --force-ocr? The PDFs had existing (broken) text layers. This flag ignores them and performs fresh OCR on the images.

Output: 132 text files, ~25,000-47,000 characters each

Stage 3: Text Chunking

Large documents must be split into smaller pieces for embedding:

Full document (~40,000 chars)
    ↓
Chunks of ~1,500 characters
    ↓
200-character overlap between chunks
    ↓
~30 chunks per document

Why overlap? Prevents information loss at chunk boundaries. A sentence split across two chunks will appear complete in at least one of them.

Result: 4,126 total chunks from 132 documents

Stage 4: Vector Embeddings

Tool: OpenAI Embeddings API (text-embedding-ada-002)

Text chunk → OpenAI API → 1536-dimensional vector

Each chunk becomes a point in 1536-dimensional space where semantically similar texts are located near each other.

Example: “personaltidning” (staff magazine) and “företagets nyhetsbrev” (company newsletter) would be close in vector space, even though they share no words.

Stage 5: Vector Database

Tool: Supabase with pgvector extension

Schema:

CREATE TABLE pukslagaren_chunks (
  id bigserial PRIMARY KEY,
  content text NOT NULL,           -- The actual text
  source_file text NOT NULL,       -- Original PDF filename
  year integer,                    -- 1968-2012
  issue text,                      -- "1", "2", "sommaren", "julen"
  chunk_index integer,             -- Position in document
  embedding vector(1536),          -- The vector representation
  created_at timestamptz
);

-- Enable fast similarity search
CREATE INDEX ON pukslagaren_chunks
  USING ivfflat (embedding vector_cosine_ops);

Search function:

CREATE FUNCTION search_pukslagaren(
  query_embedding vector(1536),
  match_count int DEFAULT 5,
  filter_year int DEFAULT NULL
)

Uses cosine similarity to find the closest vectors to the query.

The Search Flow (RAG)

RAG = Retrieval-Augmented Generation

User question: "Vad skrev Pukslagaren om Vietnamkriget?"
                            ↓
            [1] Embed the question
                            ↓
         Query vector (1536 dimensions)
                            ↓
    [2] Search vector database (cosine similarity)
                            ↓
      Top 5 most relevant text chunks
                            ↓
   [3] Send chunks + question to GPT-4
                            ↓
      "Enligt Pukslagaren 1972-3 skrev man..."

Why RAG instead of fine-tuning?

No hallucination – answers come from actual documents
Traceable sources – can cite specific issues
Updatable – add new documents without retraining
Cost-effective – embeddings are cheap, fine-tuning is expensive

The API Layer

Supabase Edge Function (Deno/TypeScript):

POST /search-pukslagaren
{
  "query": "Vietnamkriget",
  "year": 1972,        // optional filter
  "match_count": 5
}

The function:

Receives the search query
Calls OpenAI to embed the query
Searches Supabase with the embedding
Returns matching chunks with metadata

Cloudflare Worker (proxy for clean URLs):

api.jardenberg.se/pukslagaren/search-pukslagaren
         ↓ proxies to ↓
wnioekumgtuppvuqsbsu.supabase.co/functions/v1/search-pukslagaren

The Custom GPT Integration

OpenAPI Specification tells ChatGPT how to call our API:

openapi: 3.0.0
info:
  title: Pukslagaren Archive Search
servers:
  - url: https://api.jardenberg.se/pukslagaren
paths:
  /search-pukslagaren:
    post:
      operationId: searchPukslagaren
      parameters:
        - name: query
        - name: year
        - name: match_count

GPT Instructions:

Du är en expert på Pukslagarens arkiv (1968-2012).
När användaren ställer en fråga:
1. Använd searchPukslagaren för att hitta relevant innehåll
2. Sammanfatta informationen på svenska
3. Ange alltid källan (år och nummer)

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│ Custom GPT │
│ “Pukslagaren Arkiv” │
└─────────────────────┬───────────────────────────────────────┘
│ OpenAPI Action
▼
┌─────────────────────────────────────────────────────────────┐
│ Cloudflare Worker (Proxy) │
│ api.jardenberg.se/pukslagaren/* │
└─────────────────────┬───────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Supabase Edge Function │
│ search-pukslagaren │
│ ┌─────────────┐ ┌──────────────────┐ │
│ │ Embed query │───▶│ Vector similarity │ │
│ │ (OpenAI) │ │ search (pgvector) │ │
│ └─────────────┘ └──────────────────┘ │
└─────────────────────┬───────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Supabase Database │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ pukslagaren_chunks │ │
│ │ • 4,126 rows │ │
│ │ • content, year, issue, embedding │ │
│ │ • ivfflat index for fast similarity search │ │
│ └────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

Key Technologies Used

Layer	Technology	Purpose
OCR	ocrmypdf + Tesseract	Extract text from scanned images
Embeddings	OpenAI text-embedding-ada-002	Convert text to vectors
Vector DB	Supabase + pgvector	Store and search embeddings
API	Supabase Edge Functions (Deno)	Serverless search endpoint
Proxy	Cloudflare Workers	Custom domain routing
Interface	OpenAI Custom GPT	Natural language chat interface

Results

Input: 2.9 GB of scanned PDFs with broken text
Output: Semantic search across 45 years of content
Chunks: 4,126 searchable text segments
Latency: ~500ms for a full search query
Cost: Minimal (embeddings ~$0.10/1M tokens, Supabase free tier)

Nyfiken på data 2025