My experience at CVPR 2026

I attended CVPR 2026 in Denver, CO – my second machine learning conference after ICML in 2023, and my second conference of the year after JMM 2026 in Washington, DC (where I was an invited speaker). I wanted to write about that experience. That ship has sailed, but I thought I’d write about this one!

With ICCV and ECCV, CVPR is considered one of the top three venues for computer vision research. This year there were 16,000+ paper submissions, 4089 accepted papers, and 10,000+ attendees, so there’s a lot going on at these conferences. My original research background is not in machine learning and I do not claim expertise in frontier AI/ML research, as excited as I get being surrounded by this stuff (especially by the mathematics of it). So this is just my perspective, but I’m always happy to receive feedback and corrections if something is misstated or misrepresented (email me!).

During the course of the conference I took notes scattered across Obsidian, Signal notes-to-self, and pictures of slides and posters. It’s obviously hard to capture everything I experienced in a blog post, but I’m going to try writing down what I found most interesting or memorable (and not necessarily just going for what’s trendy).

Conference Structure

As I mentioned, there’s a lot that happens here, so I’ll start with a bird’s-eye view (you can skip this if you already know how these work). CVPR ran from Wednesday, June 3 to Sunday, June 7. The first two days are for workshops and tutorials, and the remaining three days are for the main conference. Workshops each focus on specific topics and are organized by a small group of researchers. Each has their own submission process where the organizers decide which submissions are accepted for oral presentation. Workshop proposals must themselves be submitted and accepted to CVPR for them to be held. Depending on the physical venue (I find that most convention centers are the same though), workshops happen in breakout-style rooms with maybe attendance by 200-500 people at a time. To me, they feel much more community-oriented than oral presentations at the main conference. I didn’t attend any tutorials so I’m not going to comment on those.

The Denver Convention Center. Yes that is a blue bear outside the convention center.

Then there are the main conference days. These mainly consist of oral presentations, poster sessions, and keynotes (though there are a few other various activities that happen as well). At this conference there were about two poster sessions and two oral sessions of oral presentations, happening at non-overlapping times. Each oral session consists of four concurrently running tracks, each track with a particular theme (for example, “multimodal vision” or “generative diffusion modeling”). The poster sessions happen in a big, open space in an exhibit hall, a few hundred out at a time. I really like these sessions because as an attendee you’re free to walk around, pick out what interests you, and actually engage and have a conversation with the author(s) about their work (and maybe even connect afterwards!). By contrast, oral sessions happen in auditorium-sized rooms with capacity in the 1000s. Each talk is 10-15 minutes in length and there’s time for about one question after each. It makes sense that this is the format, since oral submissions are much more selective. But it does make the format feel performative.

There are about 4-8 keynote talks that happen during the main conference. These talks are typically targeted towards a broad audience and don’t get into too much technical detail (and they’re not necessarily related to computer vision), so they tend to be interesting. For example, this year there was a talk on the state of quantum computing.

These days, CVPR and similar-tier conference use mobile apps to help their attendees network and connect with one another. This year CVPR used Cvent. Their people search was good, but I didn’t find their user experience to otherwise be very good. A colleague of mine who also attended pointed me to this webapp that someone built (source code). It let you keyword search over workshops and tutorials. I forked it and used Claude Code to add a semantic search feature.

To keep it brief, I’ll mainly go through some of the workshops I attended, since I had more energy to take notes on them as they were earlier in the conference. For the main conference I focused more on the poster sessions (for reasons mentioned above).

Workshops

One of the first workshops I attended was on embedded vision, which is all about running vision and image processing algorithms on embedded devices that have low size, weight, or power constraints. I listened in on most of the session, but these were a few presentations that I noted down:

EventGuard: Sparsity-Aware In-Sensor Denoising for Frame-Based Event Vision Sensors. From this talk I learned about event cameras, which aren’t everyday cameras in that individual pixel data is captured independently from one another, and asynchronously only when changes are detected. These cameras can be sensitive to background noise. I’d also learned of spiking neural networks (I would need to read more about these to explain any further) and binary neural networks (where weights and activations can only be \( \pm 1 \)). One thing that stuck with me during the presentation is their neural networks reduced to mostly XNORs and popcount operations.
BlankSkip: Early-exit Object Detection onboard Nano-drones: Seemed interesting because it tries to accelerate inference by using an early-exit architecture, where “early exit” means only doing a partial inference if nothing of interest is found in frame. Their architecture is some hybrid of a MobileNet and SSD. YOLO architectures are mentioned in a related work section (also, in reading about SSD, or Single Shot Detectors, their architecture is named “SSD”, but the paper also refers to YOLO as an example of a Single Shot Detector, which is very confusing terminology). As mentioned in the paper, apparently YOLO architectures generally trade latency for accuracy compared with MobileNet.

Early-exit with MobileNet extractor and SSD head.

TinyDEVO: Deep Event-based Visual Odometry on Ultra-low-power Multi-core Microcontrollers: Visual odometry is all about all about determining the position and orientation of an object (in this case, applied to drones) from camera images. This one proposed an architecture for doing so on (low-power) microcontrollers. Average trajectory error is mentioned as an evaluation metric.

I sat in the XAI4CV (Explainable AI for Computer Vision) workshop and listened to:

FaCT: Faithful Concept Traces for Explaining Neural Networks. I don’t know a whole lot about explainable AI, so from this talk I tried to understand what, concretely or mathematically, explainability means. I’m sure there are many different definitions of this, but one key thing I learned of was the B-cos transform, which is meant to replace linear layers with an “explainable” equivalent. The below excerpt from this paper explains it well:

They use these together with sparse encoders to propose a new model called FaCT, which they claim performs better than SoTA on interpretability metrics for certain image classification tasks.

I spent quite a bit of time at the Humans of Generative AI workshop, the content of which I found to be uniquely different among the other papers at this conference. It originally caught my attention because of this post. Some of it was related to computer vision, some not, but it drew a lot of interest from security and privacy researchers.

Juliana Castro Varón of the New York Times’ AI Initiatives gave a talk (using all hand-drawn slides) on how NYT uses AI to improve user experience in searching for and ranking news articles. Surprisingly relevant to some large-scale document search problems that I myself have worked on.
Caught in a Mafia Romance: How Users Explore Intimate Narratives with Chatbots: Analyzes usage of character.ai personas and related discussion on relevant subreddits.
Structured Listening: Codifying Human-Meaningful Voice Signals to Ground Generative AI Reasoning, which doesn’t have a link, but was about one of Modulate AI’s models (doesn’t have a corresponding paper online as far as I can tell). This one was interesting to listen to because I found out later that it was presented by the CTO of my academic sibling’s employer, Modulate AI. They’re building in-house ensemble models for advanced audio processing, taking paralinguistic aspects into consideration, and building products using these models.

The Synthetic Data for Computer Vision workshop had a lot of interesting stuff.

There was a talk given by Jia Deng of Princeton’s Vision and Learning Lab titled “Can We Train AI (from scratch) without Collecting Any Data?”. Roughly, the talk was about how one can get quite far with just procedurally-generated data, specifically for 3D vision anyways. He mentioned the Infinigen product built by their lab (open source) as well as their ProcFunc Python library that “transpiles” Blender node workflows into Python code.

Infinigen render from their hello world example outside the convention center.

Georgia Gkioxari of Caltech gave a talk on SAM 3D, Meta’s new 3D segmentation model capable of reconstructing 3D object from 2D images (she was also on the panel for Women in Computer Vision which I attended). She began the talk by explaining how one of Meta’s older models had struggled with the relative depth of different objects within a scene. This talk got theoretical very quickly, which I do like, but sometimes don’t have the background to follow. It had mentioned the term “flow matching” (which I’ll discuss later).

There was a Maritime Computer Vision Workshop that I briefly sat in but didn’t get much from.

Of the oral sessions, the most memorable one I’d attended was on visual security, which had several talks on watermarking. Of those, this one was the most memorable since it seemed like it assumed a pretty aggressive threat model.

Things I want to learn more about

There was a lot of information to take in during this conference, including lots of exciting math. These are just the first three things that came to mind and is definitely non-exhaustive:

Flow matching

Flow matching was mentioned on several posters. This conference was the first time I’d heard of it. I had asked one of the poster presenters to explain the concept to me, which I feel like I got something from. I did a little reading about it afterwards and discovered that the technique was introduced in this paper by Meta AI.

As far as I understand, flow matching is a new generative AI technique that lets one sample from complicated distributions by sampling from noise and transforming the result in a continuous manner. There are some blog posts out there that explain it. It borrows the concept of a flow used in geometry. The first thing I thought of when reading about these is of homotopies. If restricting to a finite interval, flows are technically homotopies, but they obviously carry much more structure than just topology.

Quantum computing (in general)

This one was a wildcard, but there was a great keynote talk by IBM’s CTO for quantum computing on the general state of the field. I was chatting with Claude while sitting in this talk trying to understand the basic concepts. This is what I’d gotten while doing this:

A qubit is a unit vector in \( \mathbb{C}^2 \)
\( n \) qubits are represented by an element from a tensor product of \( n \) copies of \( \mathbb{C}^2 \), or \( \mathbb{C}^2 \otimes \cdots \otimes \mathbb{C}^2 = \mathbb{C}^{2^n}\)
Like in traditional computing, we use gates to implement quantum circuits. Except for some reason these circuits must be reversible (or invertible). Mathematically, they must be represented by unitary matrices.
Like in traditional computing, where NAND and NOR are universal, the Toffoli gate is universal. And as with traditional computing, where you wouldn’t use NANDs or NORs like this in practice, you wouldn’t use Toffoli gates in practice either. Finding efficient quantum circuits is a challenge.
Qiskit is a Python library developed by IBM to simulate quantum circuits.
Quantum compilers turn high-level code like Qiskit into circuits.

This talk reminded me of my peer’s attempt at understanding quantum computing. I was also reminded of the fact that the startup, Quantiuum, IPO’d recently.

Contrastive loss and learning

Now this one’s been around for a long time. It is by no means new, I haven’t had a good reason to learn what it is exactly, but I saw it around CVPR enough for me to want to understand what it is. “Contrastive” is literally the C in CLIP, a very popular image/text embedding model released by OpenAI. Constrative learning is the process of learning a model by pushing together similar samples’ representation in a vector space, and pushing apart dissimilar ones (samples are contrasted with one another). Encord has a good blog post on contrastive learning.

VLAs and World Models

I’m lumping these two together, even though they are different, since they both help us (broadly speaking) comprehend the 3D world around us. I’ve seen both of these kinds of models mentioned by the self-driving car companies (the biggest sponsors of CVPR this year).

VLA stands for vision-language-action models, and like VLMs, consume image and text as input. But instead of outputing text, a VLA outputs a predicted sequence of actions for, say, a robot or autonomous vehicle to take. There are different ways of modeling actions in a world. GM recently published a blog post on how they’re using VLAs for making sense of vehicle trajectories in autonomous driving, for example.

Simple block diagram from the Wikipedia page on VLAs.

There’s been a lot of buzz around world models recently (eg, I tried to go to a workshop on world models, but the room was packed and overflowing, so I didn’t bother). Yann LeCun has been a major proponent of world models lately (apparently giving a lecture at Brown recently). When at Meta, his lab had introduced variants of the JEPA architecture. Fei Fei Li (founder of World Labs) wrote a blog post recently disambiguating the ways in which the term “world model” is used.

Fun things

That’s enough writing – this blog post is getting pretty long and I don’t want to deliberate too much longer. The cool thing about going to conferences is that you get to travel.

This was the second time I’d been to Denver (first time in 2019, just for fun, between grad school and starting my first job). There’s a lot to do in Denver. I went to the Museum of Illusions, which was fun, but short and gimmicky. I also went to Meow Wolf’s Convergence Station. I don’t really know what to call it other than an immersive exhibit. You should read about its history: it was started by an anarchist artist collective but is now incorporated and has multiple locations, including in Santa Fe and Las Vegas.

If you didn’t know, Denver is also close to the mountains. I’d already been up Pikes Peak in the Rocky Mountains during my last trip. This time I took a short trip to Boulder. It’s really easy to get there from Denver by bus, and both the station and buses are pretty nice. The bus takes you very close to CU Boulder, and from there, Chautauqua Park is walking distance. There are hiking trails of varying intensity levels that all starts there.

Ending thoughts

After my trip to ICML in 2023, I didn’t think the opportunity to go to a premier machine learning conference would come around again, but I was wrong! I’m grateful for my employer funding myself and a few of my colleagues to attend and learn. My attendance was also another reminder that these conferences are big and this world is small – more than once, I ran into former peers of mine while wandering around the conference venue.

If you get the chance to attend one of these conferences, you should jump at it. Conference proceedings are always made available afterwards, but it’s hard to replace the experience of directly being able to engage with researchers at the frontier of machine learning.