John Boen

Data & AI Systems Engineer

These projects begin as tools I build for my own use: to learn, to test ideas, and to solve real problems in a practical way. When they become useful, I publish them so others can use them too. What you see here is not a collection of demos or portfolio exercises, but a public view of an active engineering practice.

Featured Projects

Certification

Scenario-based certification study across providers and exams

Built because I needed a structured study system I could actually use — not flashcards, but scenario-driven practice at exam depth.

A study site featuring 1,300+ sample questions covering 26 certification exams across four providers — including the recently added CCA-F Claude Certified Architect, Foundations.

Content Systems Structured Data AI-Generated Content

Updated March 2026

Architecture

JSON-backed question bank organized by provider, exam, and domain. Static site generation produces filterable study interfaces with progress tracking via localStorage.

Tech Stack

HTML/CSS/JS, JSON data layer, localStorage persistence, AI-assisted content generation pipeline.

What I Learned

Generating high-quality exam scenarios at scale requires structured prompts and multi-pass validation. The hardest part is ensuring questions test understanding, not memorization.

JobClass

Labor market analysis with twenty lessons learned and architecture discussions

Built to understand how government job data is organized and what patterns emerge when you process it through a real pipeline.

An analysis of government job classification data supported by extraction and processing pipelines, warehouse build steps, and release pipeline output. The live site includes common practices, lessons learned, and technology explanations.

Pipelines Analytics Data Warehouse ETL

Updated March 2026

Architecture

Multi-stage pipeline: extraction from public data sources, processing and normalization, warehouse build, and static site publishing. Each stage is independently runnable and inspectable.

Tech Stack

Python extraction scripts, JSON/CSV data layers, HTML/CSS/JS presentation, pipeline orchestration, GitHub Pages publishing.

What I Learned

Government data is messy in predictable ways. The twenty lessons learned section on the site captures the practical patterns — what breaks, what scales, and what to normalize early.

Epstein DOJ Disclosures

AI-powered search over 4,000+ publicly released DOJ documents

Built to see if RAG and local AI infrastructure could make a large public document release genuinely searchable — not just downloadable, but answerable.

A Retrieval-Augmented Generation system that collects, processes, and indexes 4,049 PDF documents from the DOJ's Epstein disclosure releases. Combines vector-based semantic search with local LLM inference to enable natural-language queries across the full corpus.

RAG Vector Search Docker Local AI Pipelines Document Processing

Updated March 2026

Architecture

Three-container Docker stack: FastAPI application, Qdrant vector database, and Ollama for local GPU-accelerated LLM inference. Data pipeline downloads 5 ZIP releases (~2.8 GB) from justice.gov, extracts and organizes PDFs by dataset, then processes them through text extraction, chunking, and embedding for vector storage.

Tech Stack

Python, FastAPI, Qdrant, Ollama, sentence-transformers (all-MiniLM-L6-v2), pypdf, Docker Compose, SHA256 integrity verification, 56-test suite.

What I Learned

The hardest part of RAG at document scale isn't the AI — it's the pipeline before it. Getting 4,000+ government PDFs reliably downloaded, verified, extracted, chunked, and embedded requires serious data engineering discipline. The AI layer only works if every upstream stage is solid.

AI Usage Trust Paradox

Exploring the gap between AI adoption and trust

Built to make a compelling data point impossible to ignore — three-quarters of us use AI, but two-thirds don't trust it.

A focused presentation of the paradox that roughly three-quarters of people use AI while only about one-third trust it — designed to highlight that gap in a visually engaging and easy-to-understand format.

Data Visualization Presentation Design Research

Updated February 2026

Architecture

Single-page narrative presentation with animated data visualizations. Designed to walk a visitor through the paradox in a way that builds understanding progressively.

Tech Stack

HTML/CSS/JS, CSS animations, responsive data presentation, scroll-driven narrative.

What I Learned

Presenting data persuasively is a design problem, not a data problem. The same statistic can be forgettable or unforgettable depending on how the visitor encounters it.

Story Structure Explorer

Storyline structures with deep theory and the ability to save your work

Built to test whether agentic generation could produce constrained creative output — dialogue within defined story structures and guardrails.

A tool that began as a story arc evaluator and now supports exploration of archetypal beats, worldbuilding refinement, and agentic generation of dialogue between characters and narrator. Uses a bring-your-own API key model.

Agentic API Integration Client-Side Storage BYOK

Updated March 2026

Architecture

Client-side application with structured story models driving agentic generation. Story data persists in localStorage. API calls are made directly from the browser using the visitor's own key — no backend required.

Tech Stack

HTML/CSS/JS, LLM API integration (BYOK), localStorage persistence, agent swarm orchestration in the browser.

What I Learned

Agentic generation needs structure more than freedom. The best results came from tightly defined story models with clear guardrails, not from giving the model open-ended prompts.

More projects are in progress. This is a living collection — each project grows as I learn, and new ones get added as new questions come up.

How I Build

AI-Assisted Development

Claude is my primary coding platform. I use AI daily as part of a structured engineering process — design, prototyping, coding, testing, refinement, and content generation.

Agentic Workflows

Several of these projects use agent-based patterns — coordinated AI processes that work within defined structures and guardrails rather than open-ended generation.

Structured Pipelines

Data flows through clear stages: extraction, processing, transformation, presentation. Each stage is designed to be understandable and independently inspectable.

Local Infrastructure

I run local AI infrastructure alongside hosted services — including local models supported by my own hardware — giving me flexibility across the stack.

What Ties These Together

Every project here follows the same pattern: start with real data, build a visible pipeline, and produce something usable at the other end.

I don't hide complexity behind abstractions — I make it understandable. Pipelines, processes, and logic are exposed so you can see how things work, not just what they produce. The goal is always clarity: accessible without being dumbed down.

These aren't demo projects or isolated experiments. They're working systems I use myself, built with the same engineering discipline I'd apply to production work. The fact that they're also useful to other people is the point — that's how I know they actually work.

About

I'm John Boen — a senior data and AI systems engineer with 30 years of experience in database engineering, data engineering, analytics, and production data systems.

Over the past two-plus years, I've focused heavily on applied AI — turning a long foundation in data pipelines, system architecture, and reliability engineering into practical AI systems work. That means agentic workflows, AI-assisted development, local and hosted model integration, tool-connected systems, and data-driven applications.

My engineering philosophy is straightforward: build structured systems with visible pipelines that produce usable outputs. Use AI as a daily tool, not an occasional experiment. Make the hard parts accessible without hiding them.

These projects are where that philosophy meets real questions I wanted to answer.