Summary
We completed CS50’s Introduction to Artificial Intelligence with Python and are applying that foundation to practical AI security: prompt injection & jailbreaks, data poisoning, adversarial examples, and safe deployment patterns. Our stance is simple: treat models like production software with an attack surface — then try to break them.
Immediate surface (now → 2–5 years)
- Prompt injection & jailbreaks: untrusted content steering model behavior; indirect prompt attacks via tools/RAG.
- Data leakage via embeddings & fine-tuning: sensitive info exfiltrated from vector stores or tuned models.
- Data poisoning & model supply chain: tampered training sets, malicious weights, dependency risks.
- Adversarial examples: tiny perturbations that flip labels; physical attacks (stickers/patches) against vision models.
- Operational controls: auth to model endpoints, rate-limits, logging/review, sandboxed tool use, kill-switches.
Near-term autonomy risk (5–15 years)
The “killer robot” concern isn’t sci-fi — the software logic is trivial; hardware cost is the bottleneck. As autonomous drones/UGVs become cheaper, the insider-threat/misuse problem turns into a policy and security emergency. Our focus: control — authorization, geofencing, remote shutdown, and preventing model-driven escalation.
Longer-term (and why we still care)
General intelligence and recursive self-improvement are hard to scope operationally today, but our day-job defenses (verification, monitoring, containment, least-privilege, red-teaming) are the same muscles we’ll need if capabilities bend faster.
What we finished
- CS50AI core topics: search, logic, probability, optimization, neural networks, reinforcement learning.
- All programming assignments and projects are public.
Repo: github.com/exitvillain/harvard-artificial-intelligence
What we’re building next
- Minimal prompt-injection test harness for RAG/tool-using agents (attack strings + expected defenses).
- Tiny adversarial-example demo (image classifier flip; physical printable patch optional).
- Embedding-leak lab: measure semantic drift and secret retrieval risks from vector stores.
- Ops hardening checklist for ML endpoints (authN/Z, quotas, content filters, review gates, kill-switches).
A tiny promise
We’ll work to harden deployed models — and if robots start acting less friendly than they should, we plan to know which switch to try first.