John Boen

Data & AI Systems Engineer

These projects begin as tools I build for my own use: to learn, to test ideas, and to solve real problems in a practical way. When they become useful, I publish them so others can use them too. What you see here is not a collection of demos or portfolio exercises, but a public view of an active engineering practice.

Featured Projects

I2I — Idea to Implementation

A structured, prompt-driven SDLC workflow from idea through phased delivery

Built to codify the development process I actually use — a repeatable sequence of prompts that moves a software idea through design, PDR, planning, and phased implementation.

A static site that explains and demonstrates a prompt-driven software development lifecycle. Covers eight commands (draft-user, draft-pdr, draft-plan, gen-pdr, gen-plan, finalize, expand, implement) that move a project from user requirements through product design review, phased release plan, and per-phase execution. Serves as both workflow documentation and a portfolio piece for AI-assisted engineering practice.

SDLC AI Workflow Astro Process Design

Last pushed May 21, 2026

Architecture

Astro-based static site with structured content pages for each workflow stage and command. Includes getting-started guides, artifact examples from generated documents, and case studies showing the workflow applied to real projects.

Tech Stack

Astro 5.8, @astrojs/sitemap, sharp for image optimization, responsive design, GitHub Pages with custom domain.

What I Learned

The most valuable thing about a structured SDLC isn't the documents it produces — it's the forcing function it creates. Each stage surfaces questions you'd otherwise discover too late. Codifying the process made it teachable, which made it better.

Live Site ↗ Repository ↗

SOC 2 Readiness

An educational guide to SOC 2 readiness for technology company leaders

Built to make SOC 2 preparation understandable for founders and CTOs who need compliance to close enterprise deals — without hiding behind jargon or selling fear.

A static site that educates technology company leaders about SOC 2 readiness. Covers 12 control domains with evidence examples, readiness process timelines, service packages, and AI & data advisory modules. Designed for founders, CTOs, and security leads at SaaS, AI, data, and developer-tooling companies navigating their first SOC 2 engagement.

Compliance Security Astro Education

Last pushed May 22, 2026

Architecture

Astro-based static site with structured content pages for each of the 12 control domains (logical access, MFA, change management, logging, vulnerability management, incident response, vendor management, asset inventory, encryption, backup, data handling, privacy). Includes a RACI matrix, service boundary disclaimers, and tool evaluations.

Tech Stack

Astro 5.8, @astrojs/sitemap, sharp, dark-slate design system, Inter font, GitHub Pages with custom domain.

What I Learned

Compliance education works best when it's structured like engineering documentation — clear scope, concrete examples, and explicit boundaries about what's included and what isn't. The hardest part was being precise about where readiness consulting ends and audit work begins.

Live Site ↗ Repository ↗

HAx — Human Advantage Experiments

Actionable experiments extracted from TED Talks, organized by evidence and persona

Built to see if curated talk content could be transformed into a structured, searchable collection of practical experiments — with evidence levels, not just inspiration.

A discovery site that curates TED Talks into four clusters (Body, Cognition, Environment, Social), extracts actionable experiments, and tags each with evidence levels from High to Narrative/Conceptual. Features persona-guided navigation, Pagefind search, localStorage experiment saving, printable cards, and shareable permalinks. Non-commercial — uses TED embeds and outbound links, not rehosted media.

Curation Search Astro Evidence-Based Accessibility

Last pushed May 22, 2026

Architecture

Astro site with Preact interactive components and Zod schema validation at build time. Content organized by clusters, talks, and experiments with cross-referencing. Pagefind provides static search indexing. Five personas guide navigation paths for different user types (skeptical knowledge worker, self-optimizing student, team lead, coach/educator, wellness-curious generalist).

Tech Stack

Astro 6.3, Preact, TypeScript, Zod, Pagefind, Playwright + axe-core for E2E and accessibility testing, Cloudflare Pages, WCAG 2.1 AA compliance.

What I Learned

Evidence tagging is the hardest editorial decision in the pipeline. The difference between "High" and "Moderate" evidence changes whether an experiment feels credible or aspirational — and getting it wrong undermines the whole premise of the site. Persona-based navigation also turned out to be more useful than category browsing alone.

Live Site ↗ Repository ↗

Lessons Hub

A searchable, AI-powered library harvesting lessons learned from across all projects

Built to consolidate scattered lessons-learned documents from multiple repos into one place — searchable, browsable, and queryable by AI.

An Astro-based static site and Python pipeline that harvests markdown lessons from multiple GitHub repositories into a unified library. Features full-text search via Pagefind, a RAG-powered chatbot with citations, browser-native text-to-speech with paragraph highlighting, and automatic gap detection. Currently indexes 132 lessons from 4 repositories across 74 tags.

RAG Search Astro Multi-Repo Text-to-Speech

Last pushed May 11, 2026

Architecture

Python harvester clones source repositories and extracts markdown lessons from docs/lessons/ directories. Astro generates the static site with Pagefind for client-side full-text search. RAG chatbot uses FastAPI with configurable LLM backends (AWS Bedrock, Azure OpenAI, Vertex AI) to answer questions with citations to specific lessons.

Tech Stack

Python, Astro, Pagefind, FastAPI, Playwright (E2E tests), Pytest, Ruff, Web Speech API, GitHub Actions CI/CD, GitHub Pages with custom domain.

What I Learned

The value of lessons learned multiplies when they're aggregated. Individual project lessons are useful in context, but cross-project search reveals patterns you'd never notice in isolation. Gap detection — finding what the corpus can't answer — turned out to be as valuable as the search itself.

Live Site ↗ Repository ↗

Artemis

Calendar optimization from 12,000 space mission photos using ML and voter preferences

Built to see if statistical modeling and ML could solve a real collection optimization problem — selecting 13 images from 12,000 that balance preferences, diversity, and coverage.

A data science platform that ingests ~12,000 NASA Artemis II photos, extracts CLIP embeddings, clusters for visual diversity, scores via Elo and Borda voting, and uses Hungarian algorithm optimization to produce the best 13-image calendar. Includes an interactive web dashboard for browsing, comparison, and cluster exploration.

Machine Learning Optimization Data Science CLIP DuckDB

Last pushed May 10, 2026

Architecture

Layered data warehouse (raw, staging, core, feature store, modeling, optimization, marts, reports) in DuckDB. CLIP-based multimodal embeddings feed k-means clustering for visual diversity. Beta-Binomial posteriors, Elo ratings, and Borda scoring drive preference modeling. Hungarian algorithm assigns optimal images to calendar months.

Tech Stack

Python, DuckDB, CLIP (sentence-transformers), scikit-learn, FastAPI backend, vanilla JavaScript SPA frontend, Pillow for image processing and PDF rendering.

What I Learned

Collection optimization is harder than ranking. Selecting 13 images that are individually strong and collectively diverse requires treating it as a constrained optimization problem, not a sorting problem. The combination of voter preferences, visual embeddings, and temporal suitability made this a genuine data science challenge.

Live Site ↗ Repository ↗

JobClass

Labor market analysis with twenty-one lessons learned, a DuckDB warehouse, and pipeline visualization

Built to understand how government job data is organized and what patterns emerge when you process it through a real pipeline.

A data pipeline that ingests SOC, OEWS, O*NET, and BLS Projections into a 57+ table DuckDB warehouse across four layers (raw, staging, core, marts). The live site includes 21 lessons learned, architecture discussions, a Pipeline Explorer visualization, and CPI domain analysis. Backed by 840+ tests.

Pipelines Analytics Data Warehouse ETL DuckDB

Last pushed May 9, 2026

Architecture

Four-tier DuckDB warehouse (raw, staging, core, marts) with 57+ tables fed by 9 pipeline workflows. Includes a canvas-based Pipeline Explorer for visualizing data flow across all stages. FastAPI web app with search, hierarchy, trends, and CPI analysis.

Tech Stack

Python, DuckDB, FastAPI, HTML/CSS/JS presentation, pipeline orchestration, 840+ tests, GitHub Pages publishing with custom domain.

What I Learned

Government data is messy in predictable ways. The twenty lessons learned section on the site captures the practical patterns — what breaks, what scales, and what to normalize early.

Live Site ↗ Repository ↗

Certification

Scenario-based certification study across providers and exams

Built because I needed a structured study system I could actually use — not flashcards, but scenario-driven practice at exam depth.

A study site featuring 2,500 scenario-based questions covering 50 exams across ten providers (AWS, Azure, GCP, Anthropic, GitHub, Databricks, NVIDIA, Cisco, CompTIA, ISC2). Recently migrated to a JSON data layer with schema validation, redesigned with the Atlas design system, and expanded with a 28-lesson lessons section. Azure exams carry lifecycle badges as vendors retire and replace them.

Content Systems Structured Data AI-Generated Content JSON

Last pushed May 8, 2026

Architecture

JSON-backed question banks with schema validation, organized by provider, exam, and domain. A shared quiz engine serves all providers with progress tracking via localStorage. Atlas design system provides consistent styling across all pages.

Tech Stack

HTML/CSS/JS, JSON data layer with schema validation, localStorage persistence, AI-assisted content generation pipeline, Atlas design system.

What I Learned

Generating high-quality exam scenarios at scale requires structured prompts and multi-pass validation. The hardest part is ensuring questions test understanding, not memorization.

Live Site ↗ Repository ↗

MD Reader

A browser-based markdown reader with text-to-speech, editing, and playlist navigation

Built to streamline a daily workflow — read and edit AI responses while the next one generates, review written work, and save clean output without switching tools.

A lightweight markdown viewer with integrated text-to-speech, live editing with preview, folder-based playlist navigation, Past & Play, and one-click download. Pure static site — no frameworks, no build step, no backend. just a link to a static website published via GitHub.

Productivity Text-to-Speech Markdown Browser Tool

Last pushed April 16, 2026

Architecture

Browser-based with Marked.js for full GitHub Flavored Markdown rendering, Highlight.js for syntax-highlighted code blocks, and Web Speech API with smart chunking to handle Chrome's TTS limitations. Modular IIFE architecture in a single namespace with load-order-dependent modules.

Tech Stack

HTML/CSS/JS, Marked.js, Highlight.js, Web Speech API, File System Access API, CDN-loaded libraries only — no build tools or bundlers.

What I Learned

The Web Speech API is surprisingly capable but full of browser-specific edge cases — Chrome's 15-second silence cutoff, voice availability differences, and rate persistence all needed workarounds. Eventually I settled on something easier. I made my own tool to edit while listening. The tool became more useful than I expected - I find I am using it several times a day.

Live Site ↗ Repository ↗

Epstein DOJ Disclosures

AI-powered search over 4,000+ publicly released DOJ documents

Built to see if RAG and local AI infrastructure could make a large public document release genuinely searchable — not just downloadable, but answerable.

A Retrieval-Augmented Generation system that collects, processes, and indexes 4,049 PDF documents from the DOJ's Epstein disclosure releases. Combines vector-based semantic search with local LLM inference to enable natural-language queries across the full corpus.

RAG Vector Search Docker Local AI Pipelines Document Processing

Last pushed March 27, 2026

Architecture

Three-container Docker stack: FastAPI application, Qdrant vector database, and Ollama for local GPU-accelerated LLM inference. Data pipeline downloads 5 ZIP releases (~2.8 GB) from justice.gov, extracts and organizes PDFs by dataset, then processes them through text extraction, chunking, and embedding for vector storage.

Tech Stack

Python, FastAPI, Qdrant, Ollama, sentence-transformers (all-MiniLM-L6-v2), pypdf, Docker Compose, SHA256 integrity verification, 56-test suite.

What I Learned

The hardest part of RAG at document scale isn't the AI — it's the pipeline before it. Getting 4,000+ government PDFs reliably downloaded, verified, extracted, chunked, and embedded requires serious data engineering discipline. The AI layer only works if every upstream stage is solid.

Doc Link ↗ Repository ↗

AI Usage Trust Paradox

Exploring the gap between AI adoption and trust

Built to make a compelling data point impossible to ignore — three-quarters of us use AI, but two-thirds don't trust it.

A focused presentation of the paradox that roughly three-quarters of people use AI while only about one-third trust it — designed to highlight that gap in a visually engaging and easy-to-understand format.

Data Visualization Presentation Design Research

Last pushed March 27, 2026

Architecture

Single-page narrative presentation with animated data visualizations. Designed to walk a visitor through the paradox in a way that builds understanding progressively.

Tech Stack

HTML/CSS/JS, CSS animations, responsive data presentation, scroll-driven narrative.

What I Learned

Presenting data persuasively is a design problem, not a data problem. The same statistic can be forgettable or unforgettable depending on how the visitor encounters it.

Live Site ↗ Repository ↗

Story Structure Explorer

Storyline structures with deep theory and the ability to save your work

Built to test whether agentic generation could produce constrained creative output — dialogue within defined story structures and guardrails.

A tool that began as a story arc evaluator and now supports exploration of archetypal beats, worldbuilding refinement, and agentic generation of dialogue between characters and narrator. Uses a bring-your-own API key model.

Agentic API Integration Client-Side Storage BYOK

Last pushed March 14, 2026

Architecture

Client-side application with structured story models driving agentic generation. Story data persists in localStorage. API calls are made directly from the browser using the visitor's own key — no backend required.

Tech Stack

HTML/CSS/JS, LLM API integration (BYOK), localStorage persistence, agent swarm orchestration in the browser.

What I Learned

Agentic generation needs structure more than freedom. The best results came from tightly defined story models with clear guardrails, not from giving the model open-ended prompts.

Live Site ↗ Repository ↗

More projects are in progress. This is a living collection — each project grows as I learn, and new ones get added as new questions come up.

Recent Lessons

The latest lessons learned across all projects — harvested from Lessons Hub.

View All 132 Lessons ↗

How I Build

AI-Assisted Development

Claude is my primary coding platform. I use AI daily as part of a structured engineering process — design, prototyping, coding, testing, refinement, and content generation.

Agentic Workflows

Several of these projects use agent-based patterns — coordinated AI processes that work within defined structures and guardrails rather than open-ended generation.

Structured Pipelines

Data flows through clear stages: extraction, processing, transformation, presentation. Each stage is designed to be understandable and independently inspectable.

Local Infrastructure

I run local AI infrastructure alongside hosted services — including local models supported by my own hardware — giving me flexibility across the stack.

What Ties These Together

Every project here follows the same pattern: start with real data, build a visible pipeline, and produce something usable at the other end.

I don't hide complexity behind abstractions — I make it understandable. Pipelines, processes, and logic are exposed so you can see how things work, not just what they produce. The goal is always clarity: accessible without being dumbed down.

These aren't demo projects or isolated experiments. They're working systems I use myself, built with the same engineering discipline I'd apply to production work. The fact that they're also useful to other people is the point — that's how I know they actually work.

About

I'm John Boen — a senior data and AI systems engineer with 30 years of experience in database engineering, data engineering, analytics, and production data systems.

Over the past two-plus years, I've focused heavily on applied AI — turning a long foundation in data pipelines, system architecture, and reliability engineering into practical AI systems work. That means agentic workflows, AI-assisted development, local and hosted model integration, tool-connected systems, and data-driven applications.

My engineering philosophy is straightforward: build structured systems with visible pipelines that produce usable outputs. Use AI as a daily tool, not an occasional experiment. Make the hard parts accessible without hiding them.

These projects are where that philosophy meets real questions I wanted to answer.

LinkedIn GitHub