Ariel N. Lee – Portfolio Website

Ariel N. Lee

AI Researcher & Data Engineer

About Me

I’m an AI researcher working on multimodal foundation models, with applied experience in large-scale multimedia dataset collection, filtering, and post training. My focus is on efficient model refinement, use-case dependent benchmarks/validation, data quality, and domain specific models in both vision and NLP.

TL;DR
Raive

Raive

Founding Research Scientist

Generative multimedia foundation models with IP attribution

Data Provenance Initiative

Data Provenance Initiative (DPI)

Co-Lead

Large-scale audits of the multimodal datasets powering SOTA AI models

garage-bAInd

garage-bAInd

Co-lead & OSS Researcher

Platypus LLMs and dataset (1M+ downloads)

Boston University

M.Sc., Boston University

Electrical & Computer Engineering

Machine Learning, Data Analytics

University of California, Los Angeles

B.Sc., University of California, Los Angeles

Microbiology, Immunology, & Molecular Genetics


Download CV

Recent News

January 2025

Bridging the Data Provenance Gap Across Test, Speech, and Video accepted to ICLR 2025

November 2024

Research presentation for Women in AI & Robotics

October 2024

Research presentation for AI Tinkerers x Human Feedback Foundation

September 2024

The Rapid Decline of the AI Data Commons is accepted to NeurIPS 2024; & Improving Diffusion Model Control and Quality accepted to NeurIPS 2024 Workshop on Compositional Learning

July 2024

Our paper on the decline of the AI data commons featured in New York Times , 404 , Vox , Yahoo! Finance , & Variety

November 2023

Platypus accepted to NeurIPS 2023 Workshop on Instruction Tuning & Following, and the models collectively surpass 1M+ downloads on HuggingFace

October 2023

Guest Lecturer @ HKUST, LLMOps with Prof. Sung Kim

Selected Publications

Bridging the Data Provenance Gap

Bridging the Data Provenance Gap Across Text, Speech, and Video

Shayne Longpre, … (23 authors), Ariel N. Lee, … (15 authors), Stella Biderman, Alex Pentland, Sara Hooker, Jad Kabbara (2025)

ICLR 2025

Addressing data provenance challenges across different modalities, including text, speech, and video, proposing solutions to bridge existing gaps.

Data Provenance Initiative

Consent in Crisis: The Rapid Decline of the AI Data Commons

Shayne Longpre, Robert Mahari, Ariel N. Lee, Campbell Lund, … (44 authors), Sara Hooker, Jad Kabbara, Sandy Pentland (2024)

NeurIPS 2024 Datasets and Benchmarks Track

Analysis of 14,000+ web domains to understand evolving access restrictions in AI.

Platypus Project

Platypus: Quick, Cheap, and Powerful Refinement of LLMs

Ariel N. Lee, Cole Hunter, Nataniel Ruiz (aka garage-bAInd)

NeurIPS 2023 Workshop on Instruction Tuning & Instruction Following

Developed open-source LLMs (1M+ downloads) via data refinement, leading post-trained models at release time.

ViT Patch Selectivity

Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing

Ariel N. Lee, Sarah Adel Bargal, Janavi Kasera, Stan Sclaroff, Kate Saenko, Nataniel Ruiz

Created 2 new datasets + a data augmentation method for CNNs to simulate ViT patch selectivity, improving robustness to occlusions.

Projects & Competitions

Meta AI Competition

Meta AI Video Similarity Competition

8th overall (196 participants) | 1st in AI graduate course challenge (42 participants)

Used a pretrained Self-Supervised Descriptor for Copy Detection to find manipulated content among 40,000+ videos.

Ensemble Effect Project

Leveraging Fine-tuned Models for Prompt Prediction

Kaggle competition & AI research for predicting text prompts from generated images using an ensemble (CLIP, BLIP, ViT).

Built a custom dataset of 100k+ generated images, curated to reduce semantic overlap.

BU Wheelock Project

BU Wheelock Educational Policy Center: Analyzing Classroom Time

MLOps Development Team | Data & Process Engineer

Partnered with TeachForward & Wheelock EPC to build a feature-extraction pipeline analyzing teacher time usage from 10k+ classroom videos. Developed a simple UI via Gradio & HuggingFace Spaces.