AI Security — ObfusLabs

Date: Sep 15, 2025 • Foundations finished; security experiments ongoing

Neural network diagram with secured paths (locks)

Summary

We completed CS50’s Introduction to Artificial Intelligence with Python and are applying that foundation to practical AI security: prompt injection & jailbreaks, data poisoning, adversarial examples, and safe deployment patterns. Our stance is simple: treat models like production software with an attack surface — then try to break them.

Immediate surface (now → 2–5 years)

Prompt injection & jailbreaks: untrusted content steering model behavior; indirect prompt attacks via tools/RAG.
Data leakage via embeddings & fine-tuning: sensitive info exfiltrated from vector stores or tuned models.
Data poisoning & model supply chain: tampered training sets, malicious weights, dependency risks.
Adversarial examples: tiny perturbations that flip labels; physical attacks (stickers/patches) against vision models.
Operational controls: auth to model endpoints, rate-limits, logging/review, sandboxed tool use, kill-switches.

What we test: can untrusted inputs hijack the model or tools? can we exfiltrate secrets from embeddings? can small perturbations flip a decision?

Near-term autonomy risk (5–15 years)

The “killer robot” concern isn’t sci-fi — the software logic is trivial; hardware cost is the bottleneck. As autonomous drones/UGVs become cheaper, the insider-threat/misuse problem turns into a policy and security emergency. Our focus: control — authorization, geofencing, remote shutdown, and preventing model-driven escalation.

Longer-term (and why we still care)

General intelligence and recursive self-improvement are hard to scope operationally today, but our day-job defenses (verification, monitoring, containment, least-privilege, red-teaming) are the same muscles we’ll need if capabilities bend faster.

What we finished

CS50AI core topics: search, logic, probability, optimization, neural networks, reinforcement learning.
All programming assignments and projects are public.

Repo: github.com/exitvillain/harvard-artificial-intelligence

What we’re building next

Minimal prompt-injection test harness for RAG/tool-using agents (attack strings + expected defenses).
Tiny adversarial-example demo (image classifier flip; physical printable patch optional).
Embedding-leak lab: measure semantic drift and secret retrieval risks from vector stores.
Ops hardening checklist for ML endpoints (authN/Z, quotas, content filters, review gates, kill-switches).

A tiny promise

We’ll work to harden deployed models — and if robots start acting less friendly than they should, we plan to know which switch to try first.

← Back to Current Research