I lead data capabilities for byte-native foundation model research at an early stage startup. I have hands-on experience ranging from multimodal post-training to designing + scaling universal processing layers and efficient multi-TB curation pipelines. I am currently learning how to deploy and scale model serving.
TL;DR
Current
Head of Data ·
Sciforium
Third engineering team hire, building multimodal models and high-throughput data systems
Previous
Education
M.Sc., Boston University
Electrical & Computer Engineering
B.Sc., UCLA
Microbiology, Immunology, & Molecular Genetics
Recent News
-
Feb 2025Joined Sciforium as Head of Data, working on multimodal models and data operations
-
Jan 2025Bridging the Data Provenance Gap Across Text, Speech, and Video accepted to ICLR 2025
-
Nov 2024Raive acquired & Research presentation for Women in AI & Robotics
-
Oct 2024Research presentation for AI Tinkerers × Human Feedback Foundation
-
Sept 2024Two papers accepted to NeurIPS 2024: The Rapid Decline of the AI Data Commons & Improving Diffusion Model Control and Quality (Workshop on Compositional Learning)
-
July 2024
-
Nov 2023Platypus accepted to NeurIPS 2023 Workshop on Instruction Tuning & Following; models surpass 1M+ downloads on HuggingFace
-
Oct 2023Guest Lecturer @ HKUST, LLMOps with Prof. Sung Kim
-
Sept 2023Joined Raive as Founding Research Scientist to work on generative media models with IP attribution
Publications
Bridging the Data Provenance Gap Across Text, Speech, and Video
ICLR 2025
Addressing data provenance challenges across different modalities, including text, speech, and video,
proposing solutions to bridge existing gaps.
Consent in Crisis: The Rapid Decline of the AI Data Commons
NeurIPS 2024 Datasets and Benchmarks Track
Analysis of 14,000+ web domains to understand evolving access restrictions in AI.
From Text to Pose to Image: Improving Diffusion Model Control and Quality
NeurIPS 2024 Workshop on Compositional Learning
Novel approach for improving control and quality in diffusion models through intermediate pose representations.
Platypus: Quick, Cheap, and Powerful Refinement of LLMs
NeurIPS 2023 Workshop on Instruction Tuning & Instruction Following
Developed open-source LLMs (1M+ downloads) via data refinement, leading post-trained models at release time.
Open Source Contributions
Naturalistic AI Project
[Code & Paper Release Coming Soon] Scaled LLM framework for conversation analysis with async
processing architecture, streaming pipelines, and intelligent caching. Features 4-level annotation
system, 6 specialized analyzers, type-safe Pydantic/Instructor implementation, and dual
interfaces: interactive dashboard for exploration and Python API for programmatic research
workflows.
Platypus LLMs & Dataset
Selected Projects & Competitions
Meta AI Video Similarity Challenge
Used pretrained Self-Supervised Descriptor for Copy Detection (ResNeXt101) to retrieve and match
manipulated content across 40,000+ videos.
Leveraging Fine-tuned Models for Prompt Prediction
Ensembled fine-tuned CLIP/ViT on 105k image-prompt pairs, outperforming captioning baselines.
Built custom dataset with reduced semantic overlap for improved model training.
BU Wheelock Educational Policy Center
Developed feature extraction pipeline analyzing teacher time usage from 10,000+ classroom videos.
Created user interface via Gradio & HuggingFace Spaces for video analysis with object and activity detection.