Texas A&M University

CSCE 689 Special Topics in Vision Foundation Models

Fall 2024

[ Home | Logistics | Grading | Schedule | Canvas ]

Course Overview

Description: This graduate-level course will focus on the latest research on Vision Foundation Models. More specifically, we will study the theory and foundations of computer vision, deep learning techniques (e.g., ConvNets, Transformer, diffusion models), and vision foundation models (e.g., CLIP, SAM, DINO, Stable Diffusion, multimodal LLMs). We will study how these techniques can be applied to various topics in visual computing, including but not limited to large-scale recognition, image and video generation, text-to-image generation, 3D generation and reconstruction, and human modeling for extended reality. The course will explore relevant topics and open questions, with a strong emphasis on bridging the gap between many different fields of AI. The goal is for students to get both a high-level understanding of important problems and possible solutions, as well as a low-level understanding of technical details. The format of the class will be a mix of lectures and student research paper presentations.

Note: This is not an introductory computer vision, machine learning, or deep learning course. Instead, the course will focus on recent techniques and the goal is to prepare the students with essential knolwedge and ability to conduct reserach in computer vision and machine learning.




Course Information

Instructor: Cheng Zhang

chzhang at tamu dot edu

Office hour: W 4-5 PM, F 2-3 PM

Office: Peterson 321

Logistics

Grading Policy

  • Participation (5%)
  • Paper presentation (20%)
  • Literature review report (20%)
  • Final team project (55%)
  • Textbooks

    This course does not mandate any textbook. The lecture slides/videos and other materials provided by the instructor will be sufficient, serving as the primary reference. In addition, the students are recommended to refer to the following textbooks and materials:
  • Foundations of Computer Vision, Antonio Torralba, Phillip Isola, William F. Freeman, 2024
  • Computer Vision: Algorithms and Applications (2nd Edition), Richard Szeliski, 2010
  • Understanding Deep Learning, Simon J. D. Prince, 2023
  • Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016
  •  



    Class Schedule


    ** class schedule is subject to change **

    Date Topics Speaker Course Materials Note
    Week 1: Introduction and machine learning review
    Mon 8/19 Logistics and course overview Cheng Lecture 1 (TAMU only)
    Wed 8/21 Machine learning review
    + details Supervised/unsupervised, classification/regression, linear/non-linear, over-fitting/under-fitting, MLPs, SGD, Backprop and autodiff
    Cheng Lecture 2 Course sign up due: 8/23
    Week 2: Computer vision basics 1
    Mon 8/26 Theories and history
    + details A data-centric perspective for computer vision
    Cheng Lecture 3

    Readings:
    Svetlana Lazebnik's course
    Bill Freeman's talk
    CVPRW 2023: Big Models
    CVPRW 2024: CV 20/20
    Wed 8/28 Basic deep learning blocks for FMs
    + details CNNs and Transformers. Tokens, attention, positional codes. Vision Transformers (ViT)
    Cheng Lecture 4

    Readings:
    Attention Is All You Need
    RNN and Transformer (Chapter)
    The Annotated Transformer
    Neural Machine Translation
    Vision Transformers
    Week 3: Computer vision basics 2
    Mon 9/2 No class (Labor Day)
    Wed 9/4 Representative CV tasks and baselines
    + details Visual understanding: detection, semantics segmentation, instance segmentation. Vision+language. Geometry: depth, shape.
    Cheng Lecture 5

    Readings:
    CVPR 2019 Tutorial
    CVPR 2020 Tutorial
    Paper presentation sign up: 9/3
    Week 4: Vision Foundation Models basics 1
    Mon 9/9 Pre-training and representation learning
    + details Unsupervised learning: K-means, PCA, Autoencoder. Self-supervised learning: contrastive learning, predictive learning.
    Cheng Lecture 6

    Readings:
    Representation learning (Chapter)
    Review paper
    Lil's blog: contrastive learning
    DINOv2, CLIP, MAE
    Wed 9/11 Post-training and adaptation
    + details Supervised fine-tuning (SFT), parameter-efficient transfer learning (PETL), (visual) instruct tuning, re-parameterization, RLHF
    Cheng Lecture 7

    Readings:
    Llama 3
    Sebastian's blog
    Visual prompt tuning
    Visual prompt
    Prompt via inpainting
    Open AI RLHF, o1-preview CoT
    Week 5: Vision Foundation Models basics 2
    Mon 9/16 Generative model zoo
    Cheng Lecture 8

    Readings:
    Generative models (Chapter)
    Generative modeling meets representation learning (Chapter)
    Conditional generation (Chapter)
    Wed 9/18 Diffusion models
    Cheng Lecture 9

    Readings:
    VAE
    Lil' blog: diffusion models
    Yang Song's blog
    Week 6: Advances in visual perception and understanding (topic 1)
    Mon 9/23 Scene Understanding in the Era of Vision Foundation Models
    Dr. Lei Ke (CMU)
    Guest lecture
    Wed 9/25 Depth, point tracking, segmentation
    Chih-Chuan
    Zubair
    Readings:
    OmniMotion
    Marigold
    Depth Anything
    Medical segmentation
    Week 7: Advances in generative visual models (topic 2)
    Mon 9/30 Text-to-image, personalized generation, conditional generation
    Christian
    Shima
    Readings:
    Stable Diffusion
    DreamBooth, Textual Inversion
    ControlNet
    Wed 10/2 3D generation, video generation, new architecture
    Susav
    Zhitong
    Readings:
    DreamFusion
    Diffusion Transformer (DiT)
    Lil'blog: video generation
    Autoregressive Generation
    Week 8: Final project proposal presentation
    Mon 10/7 No class (Fall Break)
    Wed 10/9 Final project proposal presentation
    Students Proposal slide upload due: 10/8
    Week 9: Multimodal foundation models (topic 3-I)
    Mon 10/14 Pre-training and post-training
    Amirmohammad
    Jarin
    Readings:
    CLIP
    BLIPx
    Wed 10/16 Multimodal LLMs
    Atharva
    Messal
    Readings:
    Flamingo
    Mini-GPTx
    LLaVA
    LLaVA-OneVision
    Week 10: Multimodal foundation models (topic 3-II)
    Mon 10/21 Diagnosis, data, understanding
    Ishaan
    Haopeng
    Readings:
    BRAVE
    Concept bottleneck
    Platonic representation hypothesis
    Wed 10/23 New architectures, applications
    Kexin
    Cheng
    Readings:
    Transfusion
    Show-o
    SpatialRGP
    Week 11: Responsble AI (topic 4)
    Mon 10/28 Dataset, bias, long-tail, diagnosis, explainability, alignment
    Yasir
    Sunyoung
    Week 12: 3D-aware synthesis (topic 5)
    Wed 10/30 3D Human Digitalization
    Youngjoong Kwon (Stanford) Guest lecture
    Mon 11/4 3D presentations (NeRFs, Gaussian splatting), sparse view, 3D humans
    Cheng
    Neil
    Wed 11/6 Manipulation, editing, dynamics
    Fengzhi
    Junsuk
    Week 13: Foundation models for robotics (topic 6)
    Mon 11/11 Task specification
    Cheng
    Stephen
    Wed 11/13 Vision-language-action models
    Jeremy
    Debajoy
    Week 14: Final project presentation
    Mon 11/18 Trending topics (if time allows)
    Cheng
    Wed 11/20 Final project presentation
    Students
    Week 15: Final project presentation
    Mon 11/25 Final project presentation
    Students
    Wed 11/27 Reading Day
    Students Final project report due: 12/6


    Acknowledgements: The materials for this course have been pieced together from many different places. Special thanks to the following colleagues for sharing their course materials or making them available online (in alphabetical order): Ehsan Adeli, Sara Beery, Wei-Lun (Harry) Chao, Alyosha Efros, Stefano Ermon, Bill Freeman, Boqing Gong, Kaiming He, Phillip Isola, Justin Johnson, Angjoo Kanazawa, Andrej Karpathy, Bernhard Kerbl, Fei-Fei Li, Andrew Owens, Deepak Pathak, Chen Sun, Vincent Sitzmann, Yang Song, Antonio Torralba, Fernando De la Torre, Shubham Tulsiani, Kilian Weinberger, Jiajun Wu, Jun-Yan Zhu. The course website is adapted from here.