I love to build! I have engineering experience ranging from multimodal post-training and evaluation to designing + scaling universal processing layers and efficient multi-TB curation pipelines.
TL;DR
Current
Previous
Head of Data, Founding Engineer ·
Sciforium
Byte-native multimodal foundation models + high-performance data processing, curation, and evaluation systems at scale
Education
M.Sc., Boston University
Electrical & Computer Engineering
B.Sc., UCLA
Microbiology, Immunology, & Molecular Genetics
Recent News
-
Jan 2025Joined Code Metal as Senior Research Engineer to build at the edge!
-
Feb 2025Joined Sciforium as Head of Data (Founding Engineer) to work on byte-native multimodal models
-
Jan 2025Bridging the Data Provenance Gap Across Text, Speech, and Video accepted to ICLR 2025
-
Nov 2024Raive acquired & Research presentation for Women in AI & Robotics
-
Oct 2024Research presentation for AI Tinkerers × Human Feedback Foundation
-
Sept 2024Two papers accepted to NeurIPS 2024: The Rapid Decline of the AI Data Commons & Improving Diffusion Model Control and Quality (Workshop on Compositional Learning)
-
July 2024
-
Nov 2023Platypus accepted to NeurIPS 2023 Workshop on Instruction Tuning & Following; models surpass 1M+ downloads on HuggingFace
-
Oct 2023Guest Lecturer @ HKUST, LLMOps with Prof. Sung Kim
-
Sept 2023Joined Raive as Founding Research Scientist to work on generative media models with IP attribution
Publications
Bridging the Data Provenance Gap Across Text, Speech, and Video
ICLR 2025
Addressing data provenance challenges across different modalities, including text, speech, and video,
proposing solutions to bridge existing gaps.
Consent in Crisis: The Rapid Decline of the AI Data Commons
NeurIPS 2024 Datasets and Benchmarks Track
Analysis of 14,000+ web domains to understand evolving access restrictions in AI.
From Text to Pose to Image: Improving Diffusion Model Control and Quality
NeurIPS 2024 Workshop on Compositional Learning
Novel approach for improving control and quality in diffusion models through intermediate pose representations.
Platypus: Quick, Cheap, and Powerful Refinement of LLMs
NeurIPS 2023 Workshop on Instruction Tuning & Instruction Following
Developed open-source LLMs (1M+ downloads) via data refinement, leading post-trained models at release time.
Selected Projects & Competitions
Meta AI Video Similarity Challenge
Used pretrained Self-Supervised Descriptor for Copy Detection (ResNeXt101) to retrieve and match
manipulated content across 40,000+ videos.
Leveraging Fine-tuned Models for Prompt Prediction
Ensembled fine-tuned CLIP/ViT on 105k image-prompt pairs, outperforming captioning baselines.
Built custom dataset with reduced semantic overlap for improved model training.
BU Wheelock Educational Policy Center
Developed feature extraction pipeline analyzing teacher time usage from 10,000+ classroom videos.
Created user interface via Gradio & HuggingFace Spaces for video analysis with object and activity detection.