My experience at CVPR 2026


I attended CVPR 2026 in Denver, CO – my second machine learning conference after ICML in 2023, and my second conference of the year after JMM 2026 in Washington, DC (where I was an invited speaker). I wanted to write about that experience. That ship has sailed, but I thought I’d write about this one!

With ICCV and ECCV, CVPR is considered one of the top three venues for computer vision research. This year there were 16,000+ paper submissions, 4089 accepted papers, and 10,000+ attendees, so there’s a lot going on at these conferences. My original research background is not in machine learning and I do not claim expertise in frontier AI/ML research, as excited as I get being surrounded by this stuff (especially by the mathematics of it). So this is just my perspective, but I’m always happy to receive feedback and corrections if something is misstated or misrepresented (email me!).

During the course of the conference I took notes scattered across Obsidian, Signal notes-to-self, and pictures of slides and posters. It’s obviously hard to capture everything I experienced in a blog post, but I’m going to try writing down what I found most interesting or memorable (and not necessarily just going for what’s trendy).

Conference Structure

As I mentioned, there’s a lot that happens here, so I’ll start with a bird’s-eye view (you can skip this if you already know how these work). CVPR ran from Wednesday, June 3 to Sunday, June 7. The first two days are for workshops and tutorials, and the remaining three days are for the main conference. Workshops each focus on specific topics and are organized by a small group of researchers. Each has their own submission process where the organizers decide which submissions are accepted for oral presentation. Workshop proposals must themselves be submitted and accepted to CVPR for them to be held. Depending on the physical venue (I find that most convention centers are the same though), workshops happen in breakout-style rooms with maybe attendance by 200-500 people at a time. To me, they feel much more community-oriented than oral presentations at the main conference. I didn’t attend any tutorials so I’m not going to comment on those.

The Denver Convention Center. Yes that is a blue bear outside the convention center.

The Denver Convention Center. Yes that is a blue bear outside the convention center.

Then there are the main conference days. These mainly consist of oral presentations, poster sessions, and keynotes (though there are a few other various activities that happen as well). At this conference there were about two poster sessions and two oral sessions of oral presentations, happening at non-overlapping times. Each oral session consists of four concurrently running tracks, each track with a particular theme (for example, “multimodal vision” or “generative diffusion modeling”). The poster sessions happen in a big, open space in an exhibit hall, a few hundred out at a time. I really like these sessions because as an attendee you’re free to walk around, pick out what interests you, and actually engage and have a conversation with the author(s) about their work (and maybe even connect afterwards!). By contrast, oral sessions happen in auditorium-sized rooms with capacity in the 1000s. Each talk is 10-15 minutes in length and there’s time for about one question after each. It makes sense that this is the format, since oral submissions are much more selective. But it does make the format feel performative.

There are about 4-8 keynote talks that happen during the main conference. These talks are typically targeted towards a broad audience and don’t get into too much technical detail (and they’re not necessarily related to computer vision), so they tend to be interesting. For example, this year there was a talk on the state of quantum computing.

These days, CVPR and similar-tier conference use mobile apps to help their attendees network and connect with one another. This year CVPR used Cvent. Their people search was good, but I didn’t find their user experience to otherwise be very good. A colleague of mine who also attended pointed me to this webapp that someone built (source code). It let you keyword search over workshops and tutorials. I forked it and used Claude Code to add a semantic search feature.

To keep it brief, I’ll mainly go through some of the workshops I attended, since I had more energy to take notes on them as they were earlier in the conference. For the main conference I focused more on the poster sessions (for reasons mentioned above).

Workshops

One of the first workshops I attended was on embedded vision, which is all about running vision and image processing algorithms on embedded devices that have low size, weight, or power constraints. I listened in on most of the session, but these were a few presentations that I noted down:

Early-exit with MobileNet extractor and SSD head.

Early-exit with MobileNet extractor and SSD head.

I sat in the XAI4CV (Explainable AI for Computer Vision) workshop and listened to:

The B-cos transform explained

The B-cos transform explained

They use these together with sparse encoders to propose a new model called FaCT, which they claim performs better than SoTA on interpretability metrics for certain image classification tasks.

I spent quite a bit of time at the Humans of Generative AI workshop, the content of which I found to be uniquely different among the other papers at this conference. It originally caught my attention because of this post. Some of it was related to computer vision, some not, but it drew a lot of interest from security and privacy researchers.

The Synthetic Data for Computer Vision workshop had a lot of interesting stuff.

Infinigen render from their hello world example outside the convention center.

Infinigen render from their hello world example outside the convention center.

There was a Maritime Computer Vision Workshop that I briefly sat in but didn’t get much from.

Of the oral sessions, the most memorable one I’d attended was on visual security, which had several talks on watermarking. Of those, this one was the most memorable since it seemed like it assumed a pretty aggressive threat model.

Things I want to learn more about

There was a lot of information to take in during this conference, including lots of exciting math. These are just the first three things that came to mind and is definitely non-exhaustive:

Flow matching

Flow matching was mentioned on several posters. This conference was the first time I’d heard of it. I had asked one of the poster presenters to explain the concept to me, which I feel like I got something from. I did a little reading about it afterwards and discovered that the technique was introduced in this paper by Meta AI.

As far as I understand, flow matching is a new generative AI technique that lets one sample from complicated distributions by sampling from noise and transforming the result in a continuous manner. There are some blog posts out there that explain it. It borrows the concept of a flow used in geometry. The first thing I thought of when reading about these is of homotopies. If restricting to a finite interval, flows are technically homotopies, but they obviously carry much more structure than just topology.

Quantum computing (in general)

This one was a wildcard, but there was a great keynote talk by IBM’s CTO for quantum computing on the general state of the field. I was chatting with Claude while sitting in this talk trying to understand the basic concepts. This is what I’d gotten while doing this:

This talk reminded me of my peer’s attempt at understanding quantum computing. I was also reminded of the fact that the startup, Quantiuum, IPO’d recently.

Contrastive loss and learning

Now this one’s been around for a long time. It is by no means new, I haven’t had a good reason to learn what it is exactly, but I saw it around CVPR enough for me to want to understand what it is. “Contrastive” is literally the C in CLIP, a very popular image/text embedding model released by OpenAI. Constrative learning is the process of learning a model by pushing together similar samples’ representation in a vector space, and pushing apart dissimilar ones (samples are contrasted with one another). Encord has a good blog post on contrastive learning.

VLAs and World Models

I’m lumping these two together, even though they are different, since they both help us (broadly speaking) comprehend the 3D world around us. I’ve seen both of these kinds of models mentioned by the self-driving car companies (the biggest sponsors of CVPR this year).

VLA stands for vision-language-action models, and like VLMs, consume image and text as input. But instead of outputing text, a VLA outputs a predicted sequence of actions for, say, a robot or autonomous vehicle to take. There are different ways of modeling actions in a world. GM recently published a blog post on how they’re using VLAs for making sense of vehicle trajectories in autonomous driving, for example.

Simple block diagram from the Wikipedia page on VLAs.

Simple block diagram from the Wikipedia page on VLAs.

There’s been a lot of buzz around world models recently (eg, I tried to go to a workshop on world models, but the room was packed and overflowing, so I didn’t bother). Yann LeCun has been a major proponent of world models lately (apparently giving a lecture at Brown recently). When at Meta, his lab had introduced variants of the JEPA architecture. Fei Fei Li (founder of World Labs) wrote a blog post recently disambiguating the ways in which the term “world model” is used.

Fun things

That’s enough writing – this blog post is getting pretty long and I don’t want to deliberate too much longer. The cool thing about going to conferences is that you get to travel.

This was the second time I’d been to Denver (first time in 2019, just for fun, between grad school and starting my first job). There’s a lot to do in Denver. I went to the Museum of Illusions, which was fun, but short and gimmicky. I also went to Meow Wolf’s Convergence Station. I don’t really know what to call it other than an immersive exhibit. You should read about its history: it was started by an anarchist artist collective but is now incorporated and has multiple locations, including in Santa Fe and Las Vegas.

If you didn’t know, Denver is also close to the mountains. I’d already been up Pikes Peak in the Rocky Mountains during my last trip. This time I took a short trip to Boulder. It’s really easy to get there from Denver by bus, and both the station and buses are pretty nice. The bus takes you very close to CU Boulder, and from there, Chautauqua Park is walking distance. There are hiking trails of varying intensity levels that all starts there.

Ending thoughts

After my trip to ICML in 2023, I didn’t think the opportunity to go to a premier machine learning conference would come around again, but I was wrong! I’m grateful for my employer funding myself and a few of my colleagues to attend and learn. My attendance was also another reminder that these conferences are big and this world is small – more than once, I ran into former peers of mine while wandering around the conference venue.

If you get the chance to attend one of these conferences, you should jump at it. Conference proceedings are always made available afterwards, but it’s hard to replace the experience of directly being able to engage with researchers at the frontier of machine learning.