Texas A&M UniversityCSCE 689 Special Topics in Vision Foundation Models |
||
Fall 2024 |
||
Description: This graduate-level course will focus on the latest research on Vision Foundation Models. More specifically, we will study the theory and foundations of computer vision, deep learning techniques (e.g., ConvNets, Transformer, diffusion models), and vision foundation models (e.g., CLIP, SAM, DINO, Stable Diffusion, multimodal LLMs). We will study how these techniques can be applied to various topics in visual computing, including but not limited to large-scale recognition, image and video generation, text-to-image generation, 3D generation and reconstruction, and human modeling for extended reality. The course will explore relevant topics and open questions, with a strong emphasis on bridging the gap between many different fields of AI. The goal is for students to get both a high-level understanding of important problems and possible solutions, as well as a low-level understanding of technical details. The format of the class will be a mix of lectures and student research paper presentations.
Note: This is not an introductory computer vision, machine learning, or deep learning course. Instead, the course will focus on recent techniques and the goal is to prepare the students with essential knolwedge and ability to conduct reserach in computer vision and machine learning.
chzhang at tamu dot edu
Office hour: W 4-5 PM, F 2-3 PM
Office: Peterson 321
** class schedule is subject to change **
Date | Topics | Speaker | Course Materials | Note | |
Week 1: Introduction and machine learning review | |||||
Mon 8/19 | Logistics and course overview | Cheng | Lecture 1 (TAMU only) | ||
Wed 8/21 | Machine learning review+ detailsSupervised/unsupervised, classification/regression, linear/non-linear, over-fitting/under-fitting, MLPs, SGD, Backprop and autodiff |
Cheng | Lecture 2 | Course sign up due: 8/23 | |
Week 2: Computer vision basics 1 | |||||
Mon 8/26 | Theories and history+ detailsA data-centric perspective for computer vision |
Cheng |
Lecture 3
Readings: Lana Lazebnik's course Bill Freeman's talk CVPRW 2023: Big Models CVPRW 2024: CV 20/20 |
||
Wed 8/28 | Basic deep learning blocks for FMs+ detailsCNNs and Transformers. Tokens, attention, positional codes. Vision Transformers (ViT) |
Cheng |
Lecture 4
Readings: Attention Is All You Need RNN and Transformer (Chapter) The Annotated Transformer Neural Machine Translation Vision Transformers |
||
Week 3: Computer vision basics 2 | |||||
Mon 9/2 | No class (Labor Day) |
||||
Wed 9/4 | Representative CV tasks and baselines + detailsVisual understanding: detection, semantics segmentation, instance segmentation. Vision+language. Geometry: depth, shape. |
Cheng |
Lecture 5
Readings: CVPR 2019 Tutorial CVPR 2020 Tutorial |
Paper presentation sign up: 9/3 | |
Week 4: Vision Foundation Models basics 1 | |||||
Mon 9/9 | Pre-training and representation learning + detailsUnsupervised learning: K-means, PCA, Autoencoder. Self-supervised learning: contrastive learning, predictive learning. |
Cheng |
Lecture 6
Readings: Representation learning (Chapter) Review paper Lil's blog: contrastive learning DINOv2, CLIP, MAE |
||
Wed 9/11 | Post-training and adaptation+ detailsSupervised fine-tuning (SFT), parameter-efficient transfer learning (PETL), (visual) instruct tuning, re-parameterization, RLHF |
Cheng |
Lecture 7 Readings: Llama 3 Sebastian's blog Visual prompt tuning Visual prompt Prompt via inpainting Open AI RLHF, o1-preview CoT |
||
Week 5: Vision Foundation Models basics 2 | |||||
Mon 9/16 | Generative model zoo |
Cheng |
Lecture 8
Readings: Generative models (Chapter) Generative modeling meets representation learning (Chapter) Conditional generation (Chapter) |
||
Wed 9/18 | Diffusion models |
Cheng |
Lecture 9
Readings: VAE Lil' blog: diffusion models Yang Song's blog |
||
Week 6: Advances in visual perception and understanding (topic 1) | |||||
Mon 9/23 | Scene Understanding in the Era of Vision Foundation Models |
Dr. Lei Ke (CMU) |
|
Guest lecture |
|
Wed 9/25 | Depth, point tracking, segmentation |
Chih-Chuan Zubair |
Readings: OmniMotion Marigold Depth Anything Medical segmentation |
||
Week 7: Advances in generative visual models (topic 2) | |||||
Mon 9/30 | Text-to-image, personalized generation, conditional generation |
Christian Shima |
Readings: Stable Diffusion DreamBooth, Textual Inversion ControlNet |
||
Wed 10/2 | 3D generation, video generation, new architecture |
Susav Zhitong |
Readings: DreamFusion Diffusion Transformer (DiT) Lil'blog: video generation Autoregressive Generation |
||
Week 8: Final project proposal presentation | |||||
Mon 10/7 | No class (Fall Break) |
||||
Wed 10/9 | Final project proposal presentation |
Students | Proposal slide upload due: 10/8 | ||
Week 9: Multimodal foundation models (topic 3-I) | |||||
Mon 10/14 | Pre-training and post-training |
Amirmohammad Jarin |
Readings: CLIP BLIPx |
||
Wed 10/16 | Multimodal LLMs |
Atharva Messal |
Readings: Flamingo Mini-GPTx LLaVA LLaVA-OneVision |
||
Week 10: Multimodal foundation models (topic 3-II) | |||||
Mon 10/21 | Diagnosis, data, understanding |
Ishaan Haopeng |
Readings: BRAVE Concept bottleneck Platonic representation hypothesis |
||
Wed 10/23 | New architectures, applications |
Kexin Cheng |
Lecture 10 Readings: Transfusion Show-o SpatialRGP |
||
Week 11: Responsble AI (topic 4) | |||||
Mon 10/28 | Dataset, bias, long-tail, diagnosis, explainability, alignment |
Yasir Sunyoung |
Readings: Bias mitigation Dataset distillation |
||
Week 12: 3D-aware synthesis (topic 5) | |||||
Wed 10/30 | Efficient and Affordable 3D Human Digitization |
Youngjoong Kwon (Stanford) | Guest lecture |
||
Mon 11/4 | 3D representations (NeRFs, Gaussian splatting), 3D humans |
Cheng Neil |
Lecture 11 Readings: Human Gaussian Splats SMPL and SMPL-X Michael Black's talk: 3D Human Foundation Agent |
||
Wed 11/6 | Scalling up, sparse-view, dynamic |
Fengzhi Junsuk |
Readings: CityGaussian Uncertainty sources pixelSplat Dynamic 3DGS |
||
Week 13: Foundation models for robotics (topic 6) | |||||
Mon 11/11 |
Task specification
|
Cheng Stephen |
Lecture 12 Readings: Survey FMs for robot learning SayCan RT-1 3D Diffuser Actor |
||
Wed 11/13 | Vision-language-action models |
Jeremy Debajoy |
Readings: Distilled Feature Fields CLIPort |
||
Week 14: Final project presentation | |||||
Mon 11/18 | World Models |
Cheng |
Lecture 13 |
||
Wed 11/20 | Final project presentation |
Students | |||
Week 15: Final project presentation | |||||
Mon 11/25 | Final project presentation |
Students | |||
Wed 11/27 | Reading Day |
Students | Final project report due: 12/6 |