Cortical Network Responses and Visual Semantics of Movie Fragments

Post by Stephanie Williams

What's the science?

Previous neuroscience research has rigorously investigated neural processing of so-called “low-level” visual features, such as moving lines, dot patterns, etc. Recently, it has become possible to investigate more “naturalistic” stimuli that humans might encounter in daily life. In neuroimaging experiments, these naturalistic stimuli might consist of real films or natural sounds. Higher-level concepts can be extracted from these more complex naturalistic stimuli, such as whether there are people present or absent in a particular frame of a film.  This week in Nature Scientific Reports, Berezutskaya and colleagues develop a procedure for extracting high-level semantic concepts from a film and use a neural encoding model to predict cortical responses in an electrocorticography dataset. 

How did they do it?                             

Patients with medication-resistant epilepsy who had electrodes previously implanted into their brain for clinical purposes were shown a short film, presented in 30 second chunks. The authors analyzed whether they could map the high-level information from the movie onto the participants’ neural responses. To extract the high-level semantic information, the authors developed a 3 part procedure: 1) they applied a visual concept recognition neural network model to extract visual concepts, 2) they used a word embedding language model to extract semantic relationships and 3) they used dimensionality reduction techniques to capture the components that represented the majority of the variance of the extracted concepts. To extract the visual concepts, the authors used a commercial computer vision model that processed the raw pixel information from the film and grouped the most likely concept labels by probability. The visual concepts represented object names (eg. camera, tv), and also abstract concepts such as emotions and qualities. Next, to extract high-level semantic information, they applied an artificial neural network language model to learn word embeddings (mathematical representations of words). The output of the language model was a semantic vector at each frame that represented a series of linguistic and semantic ties between the words that corresponded to the visual concepts. For example, the output might represent the presence or absence of characters in a frame of the movie, or motion versus stillness. 

The authors performed principal component analysis on the semantic vectors to reduce the dimensionality of the data and then focused their analysis on the principal components that explained the majority of all variance (70%). They then sorted movie frames according to how much variance was explained by a particular principal component. Next, the authors used the high-level semantic information to model the neural responses of subjects to each frame of the movie. They fit an encoding model to predict neural responses in a specific frequency band called high-frequency band (HFB, 60-120 Hz). To understand the delays in neural processing associated with high-level cognitive information, they tested a series of different time shifts relative to the onset of the film. They tested 16 different time shifts, 8 of which occurred before the film started, and 8 of which occurred after the film started. They also created a cortical map of prediction accuracy to understand which regions showed the highest accuracy.

The authors also investigated whether different regions showed a specialization for specific semantic concepts. They used the beta-weights from their linear encoding model across the principle components for each electrode (location) and clustered the beta weights. They analyzed whether the resulting clusters of electrodes were characterized by distinct networks, and they extracted the top 5 semantic components for each cluster. To check that the results of their analysis were not due to the processing of low-level visual features rather than semantic concepts, the authors also attempted to predict neural responses using only the low-level features. The authors were interested in understanding whether subsequent layers in a visual neural network recognition model would show a hierarchical-like build-up, showing increasing similarity to the extracted high-level semantic concepts, or increasing prediction accuracy of the neural data. They focused this analysis specifically on the pooling layers of a publicly available object recognition model that had been trained to recognize objects in images. They compared neural prediction accuracy of the last pooling layer, which they expected would be sensitive to object and general shapes, with the prediction accuracy of the semantic concepts. 

What did they find?

The authors found that it is possible to reduce naturalistic visual stimuli to semantic concept principal components that are easily interpretable. The authors also found that the extracted high-level semantic information captured fundamental distinctions in the film (see figure). When the authors analyzed the prediction accuracy for the high-frequency band responses as a function of time shift relative to the film onset, they found the highest accuracy for a time shift of 320 milliseconds after the stimulus onset. The cortical map of high-frequency band response prediction accuracy showed that the best prediction accuracy occurred for the occipitotemporal, parietal and inferior frontal cortex. When the authors clustered electrodes by their beta-weights from the linear encoding model, they found that there were some clusters that mapped well onto a specific cortical network. For example, electrodes in cluster #1 were found in a cortical region called the lateral fusiform gyrus. The two semantic concepts that contributed most to the neural activity in this cluster were the presence of humans and human faces. The authors repeated this for many other clusters, showing distinct semantic concept specificity for each cluster. These results show that high-level semantic concepts are associated with distinct functional cortical networks.

stephanie (1).png

When the authors examined whether low-level features could be used to make similar neural predictions, they found that prediction accuracy was worse than it was for predictions made with semantic information. This finding confirms that the authors’ results were indeed due to the semantic processing features rather than low-level features of the film. When the authors analyzed how sequential layers in the visual object recognition model were related to the semantic concepts, they found that there was a gradual increase in similarity from the first to the last intermediate layer of the model. Similarly, when the authors analyzed the relationship between sequential layers of the model and the neural prediction accuracy, they found that fit to the neural data showed a graduate increase in accuracy. When the authors compared the fit of the last intermediate pooling layer with the fit from the semantic concepts, they found a difference in whole brain prediction accuracy that favored the semantic components. Together, these results show that there is a gradual emergence of the semantic features from the lower-level visual information in the visual recognition model.

What's the impact?

This work advances our understanding of how visual information from naturalistic stimuli is interpreted by the human brain. The authors also developed and verified a new method of extracting high-level semantic concepts by combining visual object processing and natural language processing.

movies_quote_Jul28.jpg

Berezutskaya et al. Cortical Network Responses Map onto Data-driven Features that Capture Visual Semantics of Movie Fragments. Nature Scientific Reports. (2020). Access the original scientific publication here.