Jaehyun Park
I am a PhD candidate in the Data Mining Lab at KAIST, advised by Prof. Jae-Gil Lee. My research interest is on data-centric AI for LLM post-training, studied along two axes: how we train on supervision so weak signals become useful learning, and where the supervision comes from (human, AI, or both).
Post-training data is naturally weak supervision.
We often describe LLMs as generative models, but in practice they predict the next token. For any prompt, infinite continuations may be valid, so any finite dataset shows only a subset of what “correct” is, making post-training data weakly supervised. My interest is in strengthening that weak signal by using what the model already knows.
Where humans fit in a world of synthetic data.
As synthetic data becomes more common, a key question remains: can we train LLMs without human input, and if not, where are humans still necessary? I’m interested in the differences between AI-generated and human supervision: the roles each plays, the mistakes each tends to make, and how each shapes model learning.
news
| Feb 21, 2026 | Our paper See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis is accepted at CVPR 2026! 🎉🥳 |
|---|---|
| Jan 21, 2026 | Our paper Random Is Hard to Beat: Active Selection in online DPO with Modern LLMs is accepted at ICLR 2026 Workshop ICBINB! 🎉🥳 |
| Mar 01, 2025 | I am starting my Ph.D. Program at Data Mining Lab @ KAIST. |
| Feb 26, 2025 | I am starting my research scientist internship at Krafton AI. |
| Jan 22, 2025 | Our paper Active Learning for Continual Learning: Keeping the Past Alive in the Present is accepted at ICLR 2025! 🎉🥳 |
publications
- CVPR
In The IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026 - ICLRw
In The International Conference on Learning Representations Workshop, 2026 - ICLR
In The International Conference on Learning Representations, 2025 - NeurIPS
In Advances in Neural Information Processing Systems, 2024