Rei Llazani
— Intro
Art is one of the most subjective subjects.
The single & isolated question we are exploring in this short article is how well existing AI models can interpret the subjective form of art in a way that emulates a human’s subjective interpretation.
The answer to this question is a stepping stone to address a more critical question: if and how can AI “feel”?
— The Method
Twenty qualified art experts, authorities, and connoisseurs were tasked with independently evaluating 75 paintings. We will call this group “Curators”.
Three well-known LLMs were fed the same 75 paintings.
We will call this group “AI”.
Instructions to both groups were identical, in short: based on a subset of 6 parameters, score each metric for each painting on a scale from 1-10.
Importantly, all 75 paintings were original, non-public, non-accessible paintings from independent, non-famous, and living artists. These parameters are particularly important because the input data (the paintings) must not be cited, reviewed, spoken about, etc. anywhere – online or offline – in order to ensure neither groups are directly influenced by what anyone else may have said or written about the art piece nor its artistic creator. The Curators judged the pieces individually without any outside help or influence.
— The Experiment & Results
There are three main steps to achieve the objective:
The first objective is to standardize or normalize the Curator’s results (by normalization via standard deviation of Z-score). In other words, determining if there is a positive correlation and alignment between the art curators themselves. For example, one Curator’s sentiments of what constitutes a 7 out of 10 was equalized to another Curator’s version of what constitutes a 7 out of 10. The purpose of normalizing/standardizing this portion of the data is to ascertain whether we can confidently label, without significant deviations, the Curator group with an agreed cumulative “answer”. More or less, this is the “human answer” or humanity’s response to the artworks.
The results show a moderate & somewhat positive-correlation level of agreement of how the curators felt about the 75 paintings. Overall, the curators showed moderate agreement in their evaluations, with some paintings receiving more consistent scores than others. There were few instances of high disagreement, but the majority of scores suggest a reasonable level of consensus.
The second step is to analyze the results of the AI group.
The process of analysis was replicated as previously done with the human group. The AI models also generally show a moderate level of agreement amongst each other, similar to Group #1 (Curators), with some artworks receiving more consistent evaluations than others.
To summarize, we have determined that both groups mostly agree with each other in the same group.
The third and final step is to compare the two groups.
The correlation matrix shows the relationship between the average Z-scores of the Curators Group and AI Group across different metrics. A higher positive correlation indicates that the two groups tend to evaluate the paintings similarly, while a negative correlation suggests the opposite.
In summary, some metrics show strong agreement between humans and AI (particularly technical skill and composition), and some metrics show non-agreement between humans and AI (particularly context/relevance).