CSCE 689 Vision Foundation Models, Fall 2024

Texas A&M University CSCE 689 Special Topics in Vision Foundation Models
Fall 2024
[ Home \| Logistics \| Grading \| Schedule \| Canvas ]

Course Overview

Description: This graduate-level course will focus on the latest research on Vision Foundation Models. More specifically, we will study the theory and foundations of computer vision, deep learning techniques (e.g., ConvNets, Transformer, diffusion models), and vision foundation models (e.g., CLIP, SAM, DINO, Stable Diffusion, multimodal LLMs). We will study how these techniques can be applied to various topics in visual computing, including but not limited to large-scale recognition, image and video generation, text-to-image generation, 3D generation and reconstruction, and human modeling for extended reality. The course will explore relevant topics and open questions, with a strong emphasis on bridging the gap between many different fields of AI. The goal is for students to get both a high-level understanding of important problems and possible solutions, as well as a low-level understanding of technical details. The format of the class will be a mix of lectures and student research paper presentations.

Note: This is not an introductory computer vision, machine learning, or deep learning course. Instead, the course will focus on recent techniques and the goal is to prepare the students with essential knolwedge and ability to conduct reserach in computer vision and machine learning.

Course Information

Instructor: Cheng Zhang

chzhang at tamu dot edu

Office hour: W 4-5 PM, F 2-3 PM

Office: Peterson 321

Logistics

Class meetings: Monday, Wednesday 5:45 - 7:00 PM CDT, HRBB 113.
Canvas for maintaining course materials, uploading survey and final report.
Slack for announcements, Q&A, and discussion.
For personal logistical questions, contact the instructor via e-mail.

Grading Policy

Participation (5%)

Students are expected to attend classes, ask questions, engage in discussions, and deliver both paper and project presentations.

Paper presentation (20%)

Students will give a presentation on the papers related to one selected topic. This should be done individually. The presentation is graded based on the efforts and clarity in presenting the ideas of the papers.

Literature review report (20%)

Students are expected to extend their paper presentation to at least 6-10 papers and write a survey report of at least 4 pages (NeurIPS 2024 LaTeX style), excluding references. The report should contain an introduction, an overview of the background, descriptions of key algorithms and their concepts, their interconnections, and important experimental results and findings. The survey report is graded based on efforts, clarity, and organization of the papers reviewed.

Final team project (55%)

Proposal presentation (10%): Students will select a research topic for the final project and give a brief proposal presentation. The presentation should include the motivation, problem setup, baseline methods, potential solutions, and expected outcomes of the proposed project. The project must be relevant to the topics studied in this course. The presentation is graded on the efforts and clarity in presenting the ideas of the project.
Final project presentation (20%): Students are expected to present their final project, covering the motivation, problem setup, proposed approach, experimental results, etc. The presentation is graded on the efforts and clarity in presenting the ideas of the project.
Final team project report (25%): Students are expected to submit a final report limited to 6 pages (NeurIPS 2024 LaTeX style, supplementary material is encouraged), excluding references. The report should include an introduction, related work, approach, experiments, and conclusion. The report should be prepared to the standard of submission to a workshop in top-tier conferences. The report is graded on the efforts, novelty, completeness, results, and clarity of the final project.

Textbooks

This course does not mandate any textbook. The lecture slides/videos and other materials provided by the instructor will be sufficient, serving as the primary reference. In addition, the students are recommended to refer to the following textbooks and materials:

Foundations of Computer Vision, Antonio Torralba, Phillip Isola, William F. Freeman, 2024

Computer Vision: Algorithms and Applications (2nd Edition), Richard Szeliski, 2010

Understanding Deep Learning, Simon J. D. Prince, 2023

Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016

Class Schedule

** class schedule is subject to change **

Date	Topics	Speaker	Course Materials	Note
Week 1: Introduction and machine learning review
Mon 8/19	Logistics and course overview	Cheng	Lecture 1 (TAMU only)
Wed 8/21	Machine learning review + details Supervised/unsupervised, classification/regression, linear/non-linear, over-fitting/under-fitting, MLPs, SGD, Backprop and autodiff	Cheng	Lecture 2	Course sign up due: 8/23
Week 2: Computer vision basics 1
Mon 8/26	Theories and history + details A data-centric perspective for computer vision	Cheng	Lecture 3 Readings: Lana Lazebnik's course Bill Freeman's talk CVPRW 2023: Big Models CVPRW 2024: CV 20/20
Wed 8/28	Basic deep learning blocks for FMs + details CNNs and Transformers. Tokens, attention, positional codes. Vision Transformers (ViT)	Cheng	Lecture 4 Readings: Attention Is All You Need RNN and Transformer (Chapter) The Annotated Transformer Neural Machine Translation Vision Transformers
Week 3: Computer vision basics 2
Mon 9/2	No class (Labor Day)
Wed 9/4	Representative CV tasks and baselines + details Visual understanding: detection, semantics segmentation, instance segmentation. Vision+language. Geometry: depth, shape.	Cheng	Lecture 5 Readings: CVPR 2019 Tutorial CVPR 2020 Tutorial	Paper presentation sign up: 9/3
Week 4: Vision Foundation Models basics 1
Mon 9/9	Pre-training and representation learning + details Unsupervised learning: K-means, PCA, Autoencoder. Self-supervised learning: contrastive learning, predictive learning.	Cheng	Lecture 6 Readings: Representation learning (Chapter) Review paper Lil's blog: contrastive learning DINOv2, CLIP, MAE
Wed 9/11	Post-training and adaptation + details Supervised fine-tuning (SFT), parameter-efficient transfer learning (PETL), (visual) instruct tuning, re-parameterization, RLHF	Cheng	Lecture 7 Readings: Llama 3 Sebastian's blog Visual prompt tuning Visual prompt Prompt via inpainting Open AI RLHF, o1-preview CoT
Week 5: Vision Foundation Models basics 2
Mon 9/16	Generative model zoo	Cheng	Lecture 8 Readings: Generative models (Chapter) Generative modeling meets representation learning (Chapter) Conditional generation (Chapter)
Wed 9/18	Diffusion models	Cheng	Lecture 9 Readings: VAE Lil' blog: diffusion models Yang Song's blog
Week 6: Advances in visual perception and understanding (topic 1)
Mon 9/23	Scene Understanding in the Era of Vision Foundation Models	Dr. Lei Ke (CMU)		Guest lecture
Wed 9/25	Depth, point tracking, segmentation	Chih-Chuan Zubair	Readings: OmniMotion Marigold Depth Anything Medical segmentation
Week 7: Advances in generative visual models (topic 2)
Mon 9/30	Text-to-image, personalized generation, conditional generation	Christian Shima	Readings: Stable Diffusion DreamBooth, Textual Inversion ControlNet
Wed 10/2	3D generation, video generation, new architecture	Susav Zhitong	Readings: DreamFusion Diffusion Transformer (DiT) Lil'blog: video generation Autoregressive Generation
Week 8: Final project proposal presentation
Mon 10/7	No class (Fall Break)
Wed 10/9	Final project proposal presentation	Students		Proposal slide upload due: 10/8
Week 9: Multimodal foundation models (topic 3-I)
Mon 10/14	Pre-training and post-training	Amirmohammad Jarin	Readings: CLIP BLIPx
Wed 10/16	Multimodal LLMs	Atharva Messal	Readings: Flamingo Mini-GPTx LLaVA LLaVA-OneVision
Week 10: Multimodal foundation models (topic 3-II)
Mon 10/21	Diagnosis, data, understanding	Ishaan Haopeng	Readings: BRAVE Concept bottleneck Platonic representation hypothesis
Wed 10/23	New architectures, applications	Kexin Cheng	Lecture 10 Readings: Transfusion Show-o SpatialRGP
Week 11: Responsble AI (topic 4)
Mon 10/28	Dataset, bias, long-tail, diagnosis, explainability, alignment	Yasir Sunyoung	Readings: Bias mitigation Dataset distillation
Week 12: 3D-aware synthesis (topic 5)
Wed 10/30	Efficient and Affordable 3D Human Digitization	Youngjoong Kwon (Stanford)		Guest lecture
Mon 11/4	3D representations (NeRFs, Gaussian splatting), 3D humans	Cheng Neil	Lecture 11 Readings: Human Gaussian Splats SMPL and SMPL-X Michael Black's talk: 3D Human Foundation Agent
Wed 11/6	Scalling up, sparse-view, dynamic	Fengzhi Junsuk	Readings: CityGaussian Uncertainty sources pixelSplat Dynamic 3DGS
Week 13: Foundation models for robotics (topic 6)
Mon 11/11	Task specification	Cheng Stephen	Lecture 12 Readings: Survey FMs for robot learning SayCan RT-1 3D Diffuser Actor
Wed 11/13	Vision-language-action models	Jeremy Debajoy	Readings: Distilled Feature Fields CLIPort
Week 14: Final project presentation
Mon 11/18	World Models	Cheng	Lecture 13
Wed 11/20	Final project presentation	Students
Week 15: Final project presentation
Mon 11/25	Final project presentation	Students
Wed 11/27	Reading Day	Students		Final project report due: 12/6

Acknowledgements: The materials for this course have been pieced together from many different places. Special thanks to the following colleagues for sharing their course materials or making them available online (in alphabetical order): Ehsan Adeli, Sara Beery, Wei-Lun (Harry) Chao, Alyosha Efros, Stefano Ermon, Bill Freeman, Boqing Gong, Kaiming He, Phillip Isola, Justin Johnson, Angjoo Kanazawa, Andrej Karpathy, Bernhard Kerbl, Fei-Fei Li, Andrew Owens, Deepak Pathak, Chen Sun, Vincent Sitzmann, Yang Song, Antonio Torralba, Fernando De la Torre, Shubham Tulsiani, Kilian Weinberger, Jiajun Wu, Jun-Yan Zhu. The course website is adapted from here.

Texas A&M University