Vision Learning Model Analysis

Vision Learning Model Analysis of 2024 Political Ads

May 2025 · Independent Research Contribution · Wesleyan Media Project

Introduction

My lab partner, Alex, and I were both interested in exploring how Vision Language Models (VLMs) could be used to track certain aspects of video ads that have been hard to efficiently examine in the past at a large scale without expensive measures like human coding. I intended to use VLMs to monitor the portrayal of national candidates in their opponents’ advertisements. This initial exploration will focus on examining depictions of Trump and Harris. VLMs will be used to measure variables like facial expression, background color, and temperature of the ads. I wanted to compare performances across different models and explore how the composition of these models may affect their efficiency. My work used the Kosmos and Gemini APIs.

Alex was interested in using a model to identify people as billionaires, specifically Elon Musk, and track how often they are featured in political ads. In the 2024 campaign, political ads seemed more important than ever since the race changed dramatically in the summer. With a practically new campaign in a very compressed timeframe, ads became a more prominent tool. We felt that it was important to examine how candidates were being portrayed, as well as who was being advertised.

This article is abridged for Rowan’s website and is edited to include only his contributions to the project.

Data

The two datasets used were sourced from Meta’s Ad Library and compiled by the WMP into a research-usable format hosted in Google Cloud and accessible via BigQuery. Rowan’s dataset contained Trump and Harris video ads, which contained the opposing candidate. It should be noted that a limitation of these datasets is their scope: as they are only social media ads from Meta, we cannot guarantee comprehensive coverage of all ads featuring Trump, Harris, or Musk across television or all social media platforms.

Methods

We both employed a two-part strategy to analyze our ads. First, since most of the ads were videos, and the VLMs could only analyze individual image files, this meant turning those videos into collections of frames. We used cv2 and Google Drive to turn every video into a folder of frames. A folder would be created for each new video, using cv2 to capture and save a frame to the folder. We put this into a loop that terminated when it read all the frames in the video. The whole script looped until there were no more videos left in the dataset.

However, we quickly ran into issues when trying to process all of these frames. Since the default is 24 frames per second, a minute of footage becomes 1440 frame processing requests. When everything was run together, the estimated completion time was unfeasible. Figuring we wanted to complete this project before the next presidential election, we decided to switch our approach to a scene-based capture rather than frame-by-frame. Initially, we used Gemini’s shot detection to detect when the scenes changed, and captured the first frame of the scene. However, when browsing the results, we discovered that Gemini would consider the transition as part of the scene. If an ad used a “flash to white” transition between scenes, for example, we would get a white frame as the first frame of the scene. PySceneDetect was able to detect transitions and capture more representative frames.

Our methods for answering our individual questions were logistically similar but technically different. A very structured prompt was used for Gemini to make sure the data was uniform and output in a way that could be analyzed more objectively. We created 5 variables for the model to return a list formatted as [contains_person, brightness, red_blue, warmth, expression]. The first variable was a binary for yes and no, and we asked the machine if Harris or Trump was in the advertisement (changing depending on who was sponsoring the advertisement). Originally, we had planned to ask it to skip all other variables if this was 0, but the machine became confused at times and would fail to put 0 or even put N/A variables when the first one was identified. We instead ran two queries: if the person was in the image, and then the other variables on brightness, background, warmth, and facial expression and added a variable for if Gemini believed the person was photoshopped into a different background. (This was more efficient, but also more expensive.) All variables were binary except the expression variable, which prompted the model to return a specific string from a given list of happy, sad, serious, angry, or neutral. When running over its isolations of Trump and Harris, Gemini would occasionally correct itself and say that there was no picture of Harris/Trump in the advertisement. These responses were not parsed and, therefore, automatically removed. Gemini had to be told to ignore all text in the image.

Results

Data for the Gemini analysis was divided into 3 separate categories: Harris ads showing Trump, Trump ads showing Harris, and any national ad showing Elon Musk. The categories were then analyzed for light, warmth, background, and expressions.

We first wanted to test the general reliability of the AI: Table 1 shows the result of running Gemini on its default settings, temperature at 1.0 and Top-K‎ at 50 (usually Gemini defaults this at around 40), three times on video ad frames containing Harris. We then adjusted the settings, so it considered only the most likely outcome [Top K = 1] and the temperature is at a lowered 0.3, and ran this three times as well. Each percentage is a percentage of the results for each variable that was different. A lower Top K and Temperature led to far more homogeneity in responses.

**Table 1:** How Temp + Top-K Adjustments Affected Inconsistencies in Variable Responses
Variable	Temp: 0.3, Top-k: 1	Temp: 1.0, Top-k: 50
Brightness	0.5%	5.5%
Edited BG	1.0%	5.0%
Red/Blue BG	1.0%	10.6%
Warmth	9.5%	25.6%
Expression	4.5%	14.1%

Table 2 shows that the number of inconsistent responses and the degree to which they were inconsistent varied greatly with Top K and temperature adjustment.

**Table 2:** Number of Variables Changed When Analysis Ran 3 Times
Variable	Temp: 0.3, Top-k: 1	Temp: 1.0, Top-k: 50
No Changes	168	100
1 Change	29	81
2 Changes	2	14
3 Changes	0	4

In the responses to Harris' video depictions with Top K and Temperature reduced, neutral, and serious got confused 14 times, neutral and unsure twice, and happy and neutral twice. In the more variable Top K and Temperature, neutral and serious were confused 52 times, and happy against neutral and unsure were each confused twice. It is possible (but unconfirmed) that running the tests all at once in the same query allows the model to update its memory and be more easily consistent. When rerunning the same queries multiple times on testing subsets of the data, the results remain consistent.

The ads for Trump and Harris generally focused on neutral or serious expressions. Harris was far more likely to be depicted as happy than Trump, and Trump was more likely to be shown as angry. This tracks with their messaging: Trump attempted to label Harris as "Cackling Kamala,” and Harris wanted to make her opponent seem vengeful and bitter. Note that there were no instances of “sad.”

Chart 3: Each Candidate’s depicted emotions as a percentage of their opponent’s overall spend.

It should also be noted that happiness was lightly associated with an increase in brightness compared to darkness (Chart 4). Both candidates were more likely to be seen in a brighter light environment in video ads, but Harris was rarely ever depicted in darkness. Many of the clips of her involved interviews were in well-lit environments.

Chart 4: Brightness levels compared to happiness across candidates

Harris was far less likely to be edited out of her original environment in video ads, according to Gemini, at under 1.5% compared to Trump’s 16.9%. When edited out of her environment, Harris was more likely to be edited onto blue backgrounds in video ads than Trump, but both Harris and Trump were equally likely to be edited onto red backgrounds. A popular Trump photo ad, aired far more than any of the edited Harris ads, depicted Harris in a swampified DC. The spending on that ad, along with other photo ads was significant enough to make her photoshopped in 21.26% of the instances of Trump-aired photos of her overall.

Chart 5: Background color of Photoshopped Images with each Candidate

Trump was more likely to be colder in color, and angry depictions of Trump specifically were the coldest candidate-expression combination. The results here are opposite to what we may typically expect: cold colors are supposed to be more subliminally relaxing, and warm colors are supposed to be more active. Harris was much more likely to be warmly lit when she was serious or neutral, and Trump was more warmly lit when happy.

Chart 6: Warmth Levels in both Trump and Harris Ads by Expression as Percent of Total Spend

Kosmos took hours to process the emotions for frames in which Harris appeared, but the nature of its bounding boxes meant that it had the potential to be better at assessing individual expressions. This meant that if there are two people on screen (Trump would occasionally have a picture of himself on the bottom of the screen), it would not contort the emotions of both individuals. Unfortunately, it was poor at outputting results as a list or even a single word, and any outputs it gave often contained no answer and were therefore impossible to accurately parse. If the model can be improved in this regard, it has the potential to make a meaningful contribution to this research area.

Elon Musk was depicted as far happier than Trump or Harris. Over 56% of the spending on Musk video and photo ads showed him expressing feelings of happiness, according to Gemini, and over half of those depictions were in darker light. Both of those findings were the opposite of the findings of the two presidential candidates. It could be theorized that Trump wanted to tout Musk as a loving, genius ally who would happily work towards the MAGA agenda. Harris wanted to portray him as a self-serving billionaire who derived joy from doing harmful things to the country. Musk’s self-assigned description as “dark MAGA” likely furthered this depiction.

Conclusion

Our research highlights the advancements and the limitations of modern VLMs when it comes to political advertising. Gemini was fairly competent at understanding where Harris or Trump appeared in the advertisements—Google has established the existing framework for facial recognition in Google Photos for years. It was less consistent at identifying the emotion displayed, and could become confused when needing to identify subtle differences in expression, like stern instead of indifferent. When adjusting Gemini’s Top-K and temperature features, the margin of error is tightened, and the results become more reliable.

With more time and resources, future research could integrate facial recognition tools or analyze ads across multiple platforms, such as Google or television, to compare framing strategies. For now, our findings show that VLMs are useful in making observations about the more under-the-hood aspects of visual ads. Our findings also highlight the importance of prompt engineering, model choice, and framing strategy when it comes to employing these models in real-world political content.

This project, the first of its kind, shows promising potential for messaging and framing analysis work being done large-scale by VLMs. In the immediate, this strategy can continue to inform how very public figures are depicted in media, such as how Trump and Harris depict themselves in their advertisements. Another next step would be to potentially widen the scale for this analysis, seeing if the vision recognition by Google could be done for Senate or House races with less identifiable figures, or if the model can be trained to even perform this analysis on local races of any kind.

HTML formatting, including charts, tables, and scaling, by Rowan Cahill.