Nyfiken på data 2025

Pukslagaren Digital Archive Project: A Complete RAG Pipeline

Overview

This project transformed 45 years of a Swedish newspaper’s staff magazine (Pukslagaren, 1968-2012) from unsearchable scanned PDFs into a semantic search system powered by AI. The end result is a Custom GPT that can answer natural language questions about the archive’s contents.


The Problem

We had 132 PDF files containing scanned magazine pages. The issues:

  1. Poor text layer quality – The existing OCR was garbled (e.g., “§tÅffiÅmEN” instead of “STAFFAN”)
  2. No semantic search – Traditional keyword search fails when you don’t know exact terms
  3. Size constraints – 2.9 GB total, far exceeding ChatGPT’s 512 MB knowledge base limit
  4. No structured metadata – Couldn’t filter by year or issue number

The Solution: A 5-Stage Pipeline

Stage 1: PDF Organization

132 individual PDFs (quarterly/biannual issues)
    ↓
Organized by year and issue number
    ↓
Naming pattern: "Pukslagaren YYYY-N.pdf"

Stage 2: Optical Character Recognition (OCR)

Tool: ocrmypdf with Tesseract (Swedish language pack)

ocrmypdf --force-ocr --language swe input.pdf output.pdf
pdftotext output.pdf extracted.txt

Why --force-ocr? The PDFs had existing (broken) text layers. This flag ignores them and performs fresh OCR on the images.

Output: 132 text files, ~25,000-47,000 characters each

Stage 3: Text Chunking

Large documents must be split into smaller pieces for embedding:

Full document (~40,000 chars)
    ↓
Chunks of ~1,500 characters
    ↓
200-character overlap between chunks
    ↓
~30 chunks per document

Why overlap? Prevents information loss at chunk boundaries. A sentence split across two chunks will appear complete in at least one of them.

Result: 4,126 total chunks from 132 documents

Stage 4: Vector Embeddings

Tool: OpenAI Embeddings API (text-embedding-ada-002)

Text chunk → OpenAI API → 1536-dimensional vector

Each chunk becomes a point in 1536-dimensional space where semantically similar texts are located near each other.

Example: “personaltidning” (staff magazine) and “företagets nyhetsbrev” (company newsletter) would be close in vector space, even though they share no words.

Stage 5: Vector Database

Tool: Supabase with pgvector extension

Schema:

CREATE TABLE pukslagaren_chunks (
  id bigserial PRIMARY KEY,
  content text NOT NULL,           -- The actual text
  source_file text NOT NULL,       -- Original PDF filename
  year integer,                    -- 1968-2012
  issue text,                      -- "1", "2", "sommaren", "julen"
  chunk_index integer,             -- Position in document
  embedding vector(1536),          -- The vector representation
  created_at timestamptz
);

-- Enable fast similarity search
CREATE INDEX ON pukslagaren_chunks
  USING ivfflat (embedding vector_cosine_ops);

Search function:

CREATE FUNCTION search_pukslagaren(
  query_embedding vector(1536),
  match_count int DEFAULT 5,
  filter_year int DEFAULT NULL
)

Uses cosine similarity to find the closest vectors to the query.


The Search Flow (RAG)

RAG = Retrieval-Augmented Generation

User question: "Vad skrev Pukslagaren om Vietnamkriget?"
                            ↓
            [1] Embed the question
                            ↓
         Query vector (1536 dimensions)
                            ↓
    [2] Search vector database (cosine similarity)
                            ↓
      Top 5 most relevant text chunks
                            ↓
   [3] Send chunks + question to GPT-4
                            ↓
      "Enligt Pukslagaren 1972-3 skrev man..."

Why RAG instead of fine-tuning?

  • No hallucination – answers come from actual documents
  • Traceable sources – can cite specific issues
  • Updatable – add new documents without retraining
  • Cost-effective – embeddings are cheap, fine-tuning is expensive

The API Layer

Supabase Edge Function (Deno/TypeScript):

POST /search-pukslagaren
{
  "query": "Vietnamkriget",
  "year": 1972,        // optional filter
  "match_count": 5
}

The function:

  1. Receives the search query
  2. Calls OpenAI to embed the query
  3. Searches Supabase with the embedding
  4. Returns matching chunks with metadata

Cloudflare Worker (proxy for clean URLs):

api.jardenberg.se/pukslagaren/search-pukslagaren
         ↓ proxies to ↓
wnioekumgtuppvuqsbsu.supabase.co/functions/v1/search-pukslagaren

The Custom GPT Integration

OpenAPI Specification tells ChatGPT how to call our API:

openapi: 3.0.0
info:
  title: Pukslagaren Archive Search
servers:
  - url: https://api.jardenberg.se/pukslagaren
paths:
  /search-pukslagaren:
    post:
      operationId: searchPukslagaren
      parameters:
        - name: query
        - name: year
        - name: match_count

GPT Instructions:

Du är en expert på Pukslagarens arkiv (1968-2012).
När användaren ställer en fråga:
1. Använd searchPukslagaren för att hitta relevant innehåll
2. Sammanfatta informationen på svenska
3. Ange alltid källan (år och nummer)

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│ Custom GPT │
│ “Pukslagaren Arkiv” │
└─────────────────────┬───────────────────────────────────────┘
│ OpenAPI Action

┌─────────────────────────────────────────────────────────────┐
│ Cloudflare Worker (Proxy) │
│ api.jardenberg.se/pukslagaren/* │
└─────────────────────┬───────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ Supabase Edge Function │
│ search-pukslagaren │
│ ┌─────────────┐ ┌──────────────────┐ │
│ │ Embed query │───▶│ Vector similarity │ │
│ │ (OpenAI) │ │ search (pgvector) │ │
│ └─────────────┘ └──────────────────┘ │
└─────────────────────┬───────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ Supabase Database │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ pukslagaren_chunks │ │
│ │ • 4,126 rows │ │
│ │ • content, year, issue, embedding │ │
│ │ • ivfflat index for fast similarity search │ │
│ └────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

Key Technologies Used

Layer Technology Purpose
OCR ocrmypdf + Tesseract Extract text from scanned images
Embeddings OpenAI text-embedding-ada-002 Convert text to vectors
Vector DB Supabase + pgvector Store and search embeddings
API Supabase Edge Functions (Deno) Serverless search endpoint
Proxy Cloudflare Workers Custom domain routing
Interface OpenAI Custom GPT Natural language chat interface

Results

  • Input: 2.9 GB of scanned PDFs with broken text
  • Output: Semantic search across 45 years of content
  • Chunks: 4,126 searchable text segments
  • Latency: ~500ms for a full search query
  • Cost: Minimal (embeddings ~$0.10/1M tokens, Supabase free tier)