Ariel N. Lee
Ariel N. Lee

Ariel N. Lee

Senior Research Engineer

I love to build! I have engineering experience ranging from multimodal post-training and evaluation to designing + scaling universal processing layers and efficient multi-TB curation pipelines.
TL;DR
Current
Senior Research Engineer · Code Metal
Building at the edge
Lead · Data Provenance Initiative
Large-scale audits of datasets powering SOTA AI models
Previous
Head of Data, Founding Engineer · Sciforium
Byte-native multimodal foundation models + high-performance data processing, curation, and evaluation systems at scale
Founding Research Scientist · Raive (acquired 2024)
Generative media models with IP attribution
Co-lead & OSS Researcher · garage-bAInd
Platypus LLMs and dataset (1M+ downloads)
Education
M.Sc., Boston University
Electrical & Computer Engineering
B.Sc., UCLA
Microbiology, Immunology, & Molecular Genetics

Recent News

  • Jan 2025
    Joined Code Metal as Senior Research Engineer to build at the edge!
  • Feb 2025
    Joined Sciforium as Head of Data (Founding Engineer) to work on byte-native multimodal models
  • Jan 2025
    Bridging the Data Provenance Gap Across Text, Speech, and Video accepted to ICLR 2025
  • Nov 2024
    Raive acquired & Research presentation for Women in AI & Robotics
  • Oct 2024
    Research presentation for AI Tinkerers × Human Feedback Foundation
  • Sept 2024
    Two papers accepted to NeurIPS 2024: The Rapid Decline of the AI Data Commons & Improving Diffusion Model Control and Quality (Workshop on Compositional Learning)
  • July 2024
    DPI paper featured in New York Times, 404, Vox, Yahoo! Finance, Variety
  • Nov 2023
    Platypus accepted to NeurIPS 2023 Workshop on Instruction Tuning & Following; models surpass 1M+ downloads on HuggingFace
  • Oct 2023
    Guest Lecturer @ HKUST, LLMOps with Prof. Sung Kim
  • Sept 2023
    Joined Raive as Founding Research Scientist to work on generative media models with IP attribution

Publications

Bridging the Data Provenance Gap Across Text, Speech, and Video
Shayne Longpre, Nikhil Singh, … (22 authors), Ariel N. Lee, … (15 authors), Stella Biderman, Alex Pentland, Sara Hooker, Jad Kabbara
ICLR 2025
Addressing data provenance challenges across different modalities, including text, speech, and video, proposing solutions to bridge existing gaps.
Consent in Crisis: The Rapid Decline of the AI Data Commons
Shayne Longpre, Robert Mahari, Ariel N. Lee, Campbell Lund, … (44 authors), Sara Hooker, Jad Kabbara, Sandy Pentland
NeurIPS 2024 Datasets and Benchmarks Track
Analysis of 14,000+ web domains to understand evolving access restrictions in AI.
From Text to Pose to Image: Improving Diffusion Model Control and Quality
Clément Bonnet, Ariel N. Lee, Franck Wertel, Antoine Tamano, Tanguy Cizain, Pablo Ducru
NeurIPS 2024 Workshop on Compositional Learning
Novel approach for improving control and quality in diffusion models through intermediate pose representations.
Platypus: Quick, Cheap, and Powerful Refinement of LLMs
Ariel N. Lee, Cole J. Hunter, Nataniel Ruiz
NeurIPS 2023 Workshop on Instruction Tuning & Instruction Following
Developed open-source LLMs (1M+ downloads) via data refinement, leading post-trained models at release time.
Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing
Ariel N. Lee, Sarah Adel Bargal, Janavi Kasera, Stan Sclaroff, Kate Saenko, Nataniel Ruiz
arXiv preprint 2023
Created 2 new datasets and a data augmentation method for CNNs to simulate ViT patch selectivity, improving robustness to occlusions.

Selected Projects & Competitions

Meta AI Video Similarity Challenge
8th of 196 overall · 1st of 42 in AI graduate course
Used pretrained Self-Supervised Descriptor for Copy Detection (ResNeXt101) to retrieve and match manipulated content across 40,000+ videos.
Leveraging Fine-tuned Models for Prompt Prediction
Kaggle Competition 2023
Ensembled fine-tuned CLIP/ViT on 105k image-prompt pairs, outperforming captioning baselines. Built custom dataset with reduced semantic overlap for improved model training.
BU Wheelock Educational Policy Center
MLOps Development Team · Data & Process Engineer
Developed feature extraction pipeline analyzing teacher time usage from 10,000+ classroom videos. Created user interface via Gradio & HuggingFace Spaces for video analysis with object and activity detection.