Ariel N. Lee
AI Researcher & Data Engineer
About Me

I’m an AI researcher working on multimodal foundation models, with applied experience in large-scale multimedia dataset collection, filtering, and post training. My focus is on efficient model refinement, use-case dependent benchmarks/validation, data quality, and domain specific models in both vision and NLP.
TL;DR

Data Provenance Initiative (DPI)
Co-Lead
Large-scale audits of the multimodal datasets powering SOTA AI models


M.Sc., Boston University
Electrical & Computer Engineering
Machine Learning, Data Analytics

B.Sc., University of California, Los Angeles
Microbiology, Immunology, & Molecular Genetics
Download CV
Recent News
January 2025
Bridging the Data Provenance Gap Across Test, Speech, and Video accepted to ICLR 2025
November 2024
Research presentation for Women in AI & Robotics
October 2024
Research presentation for AI Tinkerers x Human Feedback Foundation
September 2024
The Rapid Decline of the AI Data Commons is accepted to NeurIPS 2024; & Improving Diffusion Model Control and Quality accepted to NeurIPS 2024 Workshop on Compositional Learning
July 2024
Our paper on the decline of the AI data commons featured in New York Times , 404 , Vox , Yahoo! Finance , & Variety
November 2023
Platypus accepted to NeurIPS 2023 Workshop on Instruction Tuning & Following, and the models collectively surpass 1M+ downloads on HuggingFace
October 2023
Guest Lecturer @ HKUST, LLMOps with Prof. Sung Kim
Selected Publications

Bridging the Data Provenance Gap Across Text, Speech, and Video
Shayne Longpre, … (23 authors), Ariel N. Lee, … (15 authors), Stella Biderman, Alex Pentland, Sara Hooker, Jad Kabbara (2025)
ICLR 2025
Addressing data provenance challenges across different modalities, including text, speech, and video, proposing solutions to bridge existing gaps.

Consent in Crisis: The Rapid Decline of the AI Data Commons
Shayne Longpre, Robert Mahari, Ariel N. Lee, Campbell Lund, … (44 authors), Sara Hooker, Jad Kabbara, Sandy Pentland (2024)
NeurIPS 2024 Datasets and Benchmarks Track
Analysis of 14,000+ web domains to understand evolving access restrictions in AI.

Platypus: Quick, Cheap, and Powerful Refinement of LLMs
Ariel N. Lee, Cole Hunter, Nataniel Ruiz (aka garage-bAInd)
NeurIPS 2023 Workshop on Instruction Tuning & Instruction Following
Developed open-source LLMs (1M+ downloads) via data refinement, leading post-trained models at release time.

Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing
Ariel N. Lee, Sarah Adel Bargal, Janavi Kasera, Stan Sclaroff, Kate Saenko, Nataniel Ruiz
Created 2 new datasets + a data augmentation method for CNNs to simulate ViT patch selectivity, improving robustness to occlusions.
Projects & Competitions

Meta AI Video Similarity Competition
8th overall (196 participants) | 1st in AI graduate course challenge (42 participants)
Used a pretrained Self-Supervised Descriptor for Copy Detection to find manipulated content among 40,000+ videos.

Leveraging Fine-tuned Models for Prompt Prediction
Kaggle competition & AI research for predicting text prompts from generated images using an ensemble (CLIP, BLIP, ViT).
Built a custom dataset of 100k+ generated images, curated to reduce semantic overlap.

BU Wheelock Educational Policy Center: Analyzing Classroom Time
MLOps Development Team | Data & Process Engineer
Partnered with TeachForward & Wheelock EPC to build a feature-extraction pipeline analyzing teacher time usage from 10k+ classroom videos. Developed a simple UI via Gradio & HuggingFace Spaces.