Ariel N. Lee
Ariel N. Lee

Ariel N. Lee

Research & Data Engineer

I lead data capabilities for byte-native foundation model research at an early stage startup. I have hands-on experience ranging from multimodal post-training to designing + scaling universal processing layers and efficient multi-TB curation pipelines. I am currently learning how to deploy and scale model serving.
TL;DR
Current
Head of Data · Sciforium
Third engineering team hire, building multimodal models and high-throughput data systems
Lead · Data Provenance Initiative
Large-scale audits of datasets powering SOTA AI models
Previous
Founding Research Scientist · Raive (acquired 2024)
Generative media models with IP attribution
Co-lead & OSS Researcher · garage-bAInd
Platypus LLMs and dataset (1M+ downloads)
Education
M.Sc., Boston University
Electrical & Computer Engineering
B.Sc., UCLA
Microbiology, Immunology, & Molecular Genetics

Recent News

  • Feb 2025
    Joined Sciforium as Head of Data, working on multimodal models and data operations
  • Jan 2025
    Bridging the Data Provenance Gap Across Text, Speech, and Video accepted to ICLR 2025
  • Nov 2024
    Raive acquired & Research presentation for Women in AI & Robotics
  • Oct 2024
    Research presentation for AI Tinkerers × Human Feedback Foundation
  • Sept 2024
    Two papers accepted to NeurIPS 2024: The Rapid Decline of the AI Data Commons & Improving Diffusion Model Control and Quality (Workshop on Compositional Learning)
  • July 2024
    DPI paper featured in New York Times, 404, Vox, Yahoo! Finance, Variety
  • Nov 2023
    Platypus accepted to NeurIPS 2023 Workshop on Instruction Tuning & Following; models surpass 1M+ downloads on HuggingFace
  • Oct 2023
    Guest Lecturer @ HKUST, LLMOps with Prof. Sung Kim
  • Sept 2023
    Joined Raive as Founding Research Scientist to work on generative media models with IP attribution

Publications

Bridging the Data Provenance Gap Across Text, Speech, and Video
Shayne Longpre, Nikhil Singh, … (22 authors), Ariel N. Lee, … (15 authors), Stella Biderman, Alex Pentland, Sara Hooker, Jad Kabbara
ICLR 2025
Addressing data provenance challenges across different modalities, including text, speech, and video, proposing solutions to bridge existing gaps.
Consent in Crisis: The Rapid Decline of the AI Data Commons
Shayne Longpre, Robert Mahari, Ariel N. Lee, Campbell Lund, … (44 authors), Sara Hooker, Jad Kabbara, Sandy Pentland
NeurIPS 2024 Datasets and Benchmarks Track
Analysis of 14,000+ web domains to understand evolving access restrictions in AI.
From Text to Pose to Image: Improving Diffusion Model Control and Quality
Clément Bonnet, Ariel N. Lee, Franck Wertel, Antoine Tamano, Tanguy Cizain, Pablo Ducru
NeurIPS 2024 Workshop on Compositional Learning
Novel approach for improving control and quality in diffusion models through intermediate pose representations.
Platypus: Quick, Cheap, and Powerful Refinement of LLMs
Ariel N. Lee, Cole J. Hunter, Nataniel Ruiz
NeurIPS 2023 Workshop on Instruction Tuning & Instruction Following
Developed open-source LLMs (1M+ downloads) via data refinement, leading post-trained models at release time.
Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing
Ariel N. Lee, Sarah Adel Bargal, Janavi Kasera, Stan Sclaroff, Kate Saenko, Nataniel Ruiz
arXiv preprint 2023
Created 2 new datasets and a data augmentation method for CNNs to simulate ViT patch selectivity, improving robustness to occlusions.

Open Source Contributions

Naturalistic AI Project
Data Provenance Initiative
[Code & Paper Release Coming Soon] Scaled LLM framework for conversation analysis with async processing architecture, streaming pipelines, and intelligent caching. Features 4-level annotation system, 6 specialized analyzers, type-safe Pydantic/Instructor implementation, and dual interfaces: interactive dashboard for exploration and Python API for programmatic research workflows.
Platypus LLMs & Dataset
garage-bAInd
Open-source models and dataset with 1M+ downloads on HuggingFace. Our best model, tuned on the Llama architecture, was the global leader in post-trained open-source LLMs at release and for two months after. Researched low-cost and efficient ways to refine domain-specific LLMs using LoRA and refined datasets.

Selected Projects & Competitions

Meta AI Video Similarity Challenge
8th of 196 overall · 1st of 42 in AI graduate course
Used pretrained Self-Supervised Descriptor for Copy Detection (ResNeXt101) to retrieve and match manipulated content across 40,000+ videos.
Leveraging Fine-tuned Models for Prompt Prediction
Kaggle Competition 2023
Ensembled fine-tuned CLIP/ViT on 105k image-prompt pairs, outperforming captioning baselines. Built custom dataset with reduced semantic overlap for improved model training.
BU Wheelock Educational Policy Center
MLOps Development Team · Data & Process Engineer
Developed feature extraction pipeline analyzing teacher time usage from 10,000+ classroom videos. Created user interface via Gradio & HuggingFace Spaces for video analysis with object and activity detection.