Ariel N. Lee

Senior Research Engineer

Email Google Scholar GitHub HuggingFace X LinkedIn

I love to build! I have engineering experience ranging from multimodal post-training and evaluation to designing + scaling universal processing layers and efficient multi-TB curation pipelines.

TL;DR

Current

Senior Research Engineer · Code Metal

Building at the edge

Lead · Data Provenance Initiative

Large-scale audits of datasets powering SOTA AI models

Head of Data, Founding Engineer · Sciforium

Byte-native multimodal foundation models + high-performance data processing, curation, and evaluation systems at scale

Founding Research Scientist · Raive (acquired 2024)

Generative media models with IP attribution

Co-lead & OSS Researcher · garage-bAInd

Platypus LLMs and dataset (1M+ downloads)

Education

M.Sc., Boston University

Electrical & Computer Engineering

B.Sc., UCLA

Microbiology, Immunology, & Molecular Genetics

Download CV

Publications

Bridging the Data Provenance Gap Across Text, Speech, and Video

Shayne Longpre, Nikhil Singh, … (22 authors), Ariel N. Lee, … (15 authors), Stella Biderman, Alex Pentland, Sara Hooker, Jad Kabbara

ICLR 2025

Addressing data provenance challenges across different modalities, including text, speech, and video, proposing solutions to bridge existing gaps.

Paper Website Code

Consent in Crisis: The Rapid Decline of the AI Data Commons

Shayne Longpre, Robert Mahari, Ariel N. Lee, Campbell Lund, … (44 authors), Sara Hooker, Jad Kabbara, Sandy Pentland

NeurIPS 2024 Datasets and Benchmarks Track

Analysis of 14,000+ web domains to understand evolving access restrictions in AI.

Paper Website Code Press: NYT

From Text to Pose to Image: Improving Diffusion Model Control and Quality

Clément Bonnet, Ariel N. Lee, Franck Wertel, Antoine Tamano, Tanguy Cizain, Pablo Ducru

NeurIPS 2024 Workshop on Compositional Learning

Novel approach for improving control and quality in diffusion models through intermediate pose representations.

Paper

Platypus: Quick, Cheap, and Powerful Refinement of LLMs

Ariel N. Lee, Cole J. Hunter, Nataniel Ruiz

NeurIPS 2023 Workshop on Instruction Tuning & Instruction Following

Developed open-source LLMs (1M+ downloads) via data refinement, leading post-trained models at release time.

Paper Website Models & Dataset Code

Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing

Ariel N. Lee, Sarah Adel Bargal, Janavi Kasera, Stan Sclaroff, Kate Saenko, Nataniel Ruiz

arXiv preprint 2023

Created 2 new datasets and a data augmentation method for CNNs to simulate ViT patch selectivity, improving robustness to occlusions.

Paper Website Dataset 1 Dataset 2

Selected Projects & Competitions

Meta AI Video Similarity Challenge

8th of 196 overall · 1st of 42 in AI graduate course

Used pretrained Self-Supervised Descriptor for Copy Detection (ResNeXt101) to retrieve and match manipulated content across 40,000+ videos.

Leaderboard

Leveraging Fine-tuned Models for Prompt Prediction

Kaggle Competition 2023

Ensembled fine-tuned CLIP/ViT on 105k image-prompt pairs, outperforming captioning baselines. Built custom dataset with reduced semantic overlap for improved model training.

Code Leaderboard

BU Wheelock Educational Policy Center

MLOps Development Team · Data & Process Engineer

Developed feature extraction pipeline analyzing teacher time usage from 10,000+ classroom videos. Created user interface via Gradio & HuggingFace Spaces for video analysis with object and activity detection.

Ariel N. Lee

Recent News

Publications

Selected Projects & Competitions