Applied AI Builder · Quantitative Analyst · San Francisco Bay Area

I build AI-powered tools, agents, and workflow automations — validated with the same rigor I bring to risk models.

I work at the intersection of quantitative analytics, model validation, and applied AI — turning repetitive analytical and business workflows into structured, reproducible, human-reviewed systems.

View selected work Résumé GitHub LinkedIn

Track record CoStar Risk Analytics Guidehouse Federal Analytics Federal Reserve Board Penn MAS-CS

Selected work

Applied AI and quantitative systems, built end to end

A sample of the analytical and applied-AI work I build — retrieval-grounded evaluation agents, generative content pipelines, and quantitative research tooling — each one reproducible, tested, and reviewed by a human before it ships.

AI Evaluation

Agentic AI Evaluation Platform

Reviews monitoring anomalies, retrieves supporting evidence, and routes uncertain findings to a human.

Streamlit Case Review page showing evidence available to the agent, the agent's structured finding, deterministic baseline and validation, and a human-review escalation status for one monitoring case

An analyst-facing review agent that retrieves evidence, produces a structured finding, and cites exactly what supports it. A distinct reviewer agent checks the work, deterministic validation catches what the model might miss, and an explicit escalation policy decides when a person needs to look.

Python · Pydantic · Anthropic SDK · Streamlit

View Code View Output Read Case Study

Content Generation

CardNews AI

Turns a topic into structured, schema-validated slides — rendered and ready for review.

5 by 2 thumbnail grid of ten rendered card-news slides on the topic Microplastics Are Everywhere, editorial layout with serif headlines and numeral watermarks

Given a topic, Claude drafts ten slides of structured content and validates it against a strict schema before anything renders. A deterministic pipeline turns the validated JSON into a reviewable PNG deck — nothing publishes without a human editing pass.

Node.js · Claude API · Puppeteer

View Code View Output Read Case Study

Also building: model validation, stress testing, and experimentation tooling — selected quantitative work below ↓

What I do

Three angles on one question: can this output be trusted?

The same validation discipline runs through everything I work on — whether the output comes from a credit model, an experiment, or an LLM.

AI & LLM Evaluation

Measuring whether LLM and agent workflows stay grounded, calibrated, and safe enough for a person to act on.

RAG evaluation
LLM / agent eval
confidence calibration
grounding checks
human-in-the-loop
failure-mode analysis

Risk & Model Validation

Testing model outputs against scenarios and stress conditions before they inform a decision.

model validation
stress testing
scenario analysis
sensitivity analysis
model monitoring
credit-risk analytics

Experimentation & Metrics

Designing experiments and metrics that separate real effects from noise and guardrail risk.

A/B testing
power / MDE
CUPED
SRM checks
metric design
segmentation

The path

Policy analytics to model risk to AI evaluation

Each role added a piece of the same skill: making analytical outputs traceable, testable, and safe to decide on.

2023 · Foundation

Federal Reserve Board

High-stakes policy & macro-financial analytics

NLP / doc analysisPython automationresearch tooling

Learned how analytical work holds up when it feeds policy — where being wrong is expensive and evidence has to be traceable.

2024–25 · Applied AI

Guidehouse

Federal analytics, RAG & LLM workflows

RAG evaluationLangChain / LangGraphfailure-mode docs

Started evaluating AI systems directly — measuring retrieval quality and documenting where LLM workflows break.

2025–now · Model risk

CoStar Group

Credit-risk model validation & monitoring

stress testingmodel QArelease readiness

Own the discipline of validation — QA, monitoring, and stress-testing model outputs before they ship.

Direction

AI Evaluation & Risk

Decision reliability for AI systems

AI evaluationmodel risktrust & safety

The same validation mindset, applied to whether AI systems are reliable enough to trust.

Selected quantitative & applied AI work

Projects, framed by what they evaluate

Each one is a small, reproducible study of whether a system does what it claims — with the methods and the verdict made explicit. The applied-AI work above is built on the same discipline.

Experimentation

Product Experimentation & Metrics Analysis

Whether a simulated feed-ranking change improves 7-day retention without moving guardrail metrics the wrong way.

100K users
SRM check
power / MDE
CUPED
segmentation
launch call

Why it mattersSound product decisions need both a statistical result and judgment about which metrics actually matter.

View on GitHub

Risk Analytics

CRE Stress Testing Workflow

How macro and market stress scenarios move commercial-real-estate risk indicators, using only public data.

scenario analysis
stress testing
forecasting
model monitoring
reproducible pipeline

Why it mattersA risk model is only useful if its outputs can be tested across changing economic conditions.

View on GitHub

Macro Forecasting

R Macro Trade & Commodity Forecast

Macro, trade, and commodity indicators built from FRED data in a reproducible R pipeline.

FRED
ARIMA
distributed-lag regression
Quarto
GitHub Actions

Why it mattersConnects the macro-policy side of my background to versioned, reproducible analytical tooling.

View on GitHub

Workflow & Documentation

LLM Research Workflow Assistant

Reusable prompt templates and human-in-the-loop checklists that package recurring research-support tasks — data QA, code review, brief review, documentation — into a consistent, reviewable workflow.

prompt templates
worked examples
human-in-the-loop
responsible-use docs

Why it mattersShows the same building discipline on a lighter workflow: packaging AI assistance so it stays consistent and reviewable instead of ad hoc.

View on GitHub

Policy environment

Macro and regulatory analysis where conclusions inform policy, so every step has to be traceable back to its evidence.

Guidehouse

Federal analytics

Multi-source financial data and early LLM / RAG evaluation, under compliance constraints and formal documentation.

CoStar Group

Private-sector risk

Credit-risk models under stress testing, monitoring, and release review before their outputs reach a decision.

Toolbox

Methods and tools I work with

Languages

Python
SQL
R
JavaScript / Node.js

AI / LLM Evaluation

Claude API
RAG evaluation
LLM evaluation
agent workflows
prompt experimentation
human-in-the-loop review
retrieval-quality analysis

Analytics & Experimentation

A/B testing
CUPED
SRM checks
power / MDE
metric design
forecasting

Risk & Model Validation

stress testing
sensitivity analysis
model monitoring
credit-risk analytics

Engineering

Git
GitHub Actions
Streamlit
Pydantic
Puppeteer
Quarto
pytest
SQLAlchemy

Visualization

Tableau
Power BI
Streamlit dashboards
Quarto

Get in touch

Focused on Bay Area roles in applied AI, workflow automation, and model risk

I'm relocating to the San Francisco Bay Area and actively looking for teams building practical AI tools and automations — with the same rigor for validation, reproducibility, and human review that I bring to risk models. If that's you, let's talk.

Email me GitHub LinkedIn Résumé (PDF)

Based in Boston · relocating to the San Francisco Bay Area