Translating visual experience between brains

I wrote a very simple thought experiment and related questions about recreating the visual experience between one macaque in another inspired by face decoding experiments conducted in the Tsao Lab. I wrote an explanation of the Tsao Lab’s work on object representation here — probably necessary reading to understand what follows. I also later wrote a thought experiment about how to use this method to infer the average face someone has seen.

Outline

Intro and motivation
Building a translator
How do you know if it worked?
Redundancy in face patch neurons
Is this tractable?
Representational drift
Revisiting assumptions

Intro and motivation

I was motivated to write this because I’m interested in methods to compare representations/experiences between biological brains and the potential limitations of these approaches.

The visual system is well studied, and there’s a pretty workable reference point for reasoning — we can compare multiple self-reports to an external representation (an image), and use this external representation to aid in decoding neural activity. Even though there are individual differences in visual perception, we all develop visual experience through interaction with environments governed by the same physical laws.1 In contrast, an experience evoked by a concept like Burma or Chanel, may not be as tractable to decode and compare between subjects in a way that tells us something about the “nature” of that representation.

More so than the example I give below, I’m interested in what an explanation of the limits of comparison could look like, at least with the approach I use — the approach being that you learn what neurons code for your percept of interest and how in subject a and then find the correspondence in subject b and perform some transformation to recreate that percept. For example, I’m skeptical that you could recreate subject a’s experience in b if b has no context for a’s experience. 2

Building a translator
Chang and Tsao 2017 decode faces a macaque is viewing. This implies we could (in principle) instantiate arbitrary faces in the macaque’s visual field through stimulating face patch neurons, given sufficiently advanced neurotech.

If we have the information we need to decode faces from neural activity, can we then use the neural activity encoding a face in one macaque and instantiate the same face in another through neurostimulation?

Assumptions:

The neurons we’re stimulating are causally relevant for face perception
Neurotech that can write specific firing rates to hundreds of individual neurons is available (in research land where this type of neuromodulation actually happens the search term would not be “neurotech” but something like “holographic optogenetics“)

The experiment will involve 2 macaques (A and B), the presentation of face images, and neural recording and stimulation technology.

The easiest, and presumably impossible, case would be if macaques A and B had identical neuroanatomy and prior training data (they had been exposed to all the same faces with the same frequency). Then they may have functionally interchangeable STA axes and origins in their face spaces, meaning A and B’s neurons respond the same to faces.

In this case, you’d just need to stimulate the corresponding neurons the same way in macaque B that you recorded from macaque A after presenting a face to it. This would also be functionally the same as just playing back A’s activity to itself.

For this thought experiment, we will treat the principal components as conserved since they were in previous face patch work from the Tsao lab and assume we have access to previous neural data from macaque B such that we know to which features each neuron responds. We’ll also assume that macaques had the same training data since they did in Chang and Tsao 2017.

But assuming A and B have different neuroanatomy, which I would, building a translator would require a few extra steps:

First, you’d determine the feature values of each facial feature that macaque A viewed.
The feature values would need to be translated to their corresponding spike rates for a neuron that would display ramp-shaped tuning in response to their variation.
Each neuron’s activity represents the weighted average of values of all the features it encodes for. So, the spike rates (from step 2) for a set of features that a given neuron responds to would be averaged to get the spike rate needed to stimulate that neuron i.e. (spike rate(feature 1) + … spike rate(feature 6)) / 6.
Stimulate the corresponding neurons with their respective spike rates in macaque B.

To return to the image from the beginning, the “translator” part would be doing steps 1-3, and the methods to do so come from Chang and Tsao.

How do you know if it worked?

Let’s say you built your translator and are pretty confident that you can translate face responses between macaques. How do you know if it actually worked? Fwiw I’m pretty interested in the general form of this question— how do we study perception in cases without self report?

You could use a behavioral reward system where macaque B is first trained to respond to the presentation of the face you’re translating between macaques. Then when you stimulate face patch neurons in macaque B, if it presents that response, you have some evidence that macaque B may be seeing that face. I don’t love this solution because ideally macaque B has never seen the face being translated, but it could be the best option given the epistemic constraints.
Another option, which would be much much harder, is that you could mimic the no-report paradigm from Hesse and Tsao 2020 where a macaque is trained to track a fixation spot that jumps around the image. This would require making a more complex stimulation protocol that includes the visual experience of the moving fixation spot. But if you did observe the macaque moving its eyes in sync with the fixation spot, you’d at least have the same level of certainty as demonstrated in prior work that the macaque was viewing the face or at the very least the fixation spot.

I’m not sure you’d ever actually know if it worked. Even if you translate the face image to yourself and check on your own, you’re just inverting the translation operation, so you wouldn’t know if it’s correct.

Redundancy in face patch neurons
Chang and Tsao record ~200 adjacent neurons across 3 patches to decode faces, yet this is a tiny fraction of total face patch neurons. To give a rough estimate of total face patch neurons: The ML face patch, one of the regions they recorded from, is estimated to be about 4mm in diameter so 33.5 mm^3. The macaque neocortex averages about 160,000 neurons/mm^3. Chang and Tsao were recording from 3 face patches (out of 6 total in macaques) so 200 neurons from a region of interest with about 16,080,000 neurons (160,000 x 33.5 x 3) is about 0.00001% of face patch neurons!3

Is this tractable?
It seems unlikely that stimulating 200 neurons in a sea of millions involved in the representation of facial features would be casually relevant enough to dramatically change perception. Is it reasonable to think you’d need to stimulate at least half of the relevant neurons that code for facial features? — potentially stimulating over 15 million neurons (based on estimates of total face patch neurons) that you’d have to write very specific firing rates to?

In stark contrast to my intuition, I’ve talked to a couple of neuroscience PIs who suspect because of dense recurrence in these areas you may only need to stimulate <1000 neurons or even <100 to drive meaningful perceptual shifts. Spectacularly, it seems like neuroscientists have been able to alter visual perception in mice by stimulating as few as 2 neurons! These results were for generating percepts corresponding to horizontal or vertical gratings, and it’s unclear how vivid or stable these percepts were. A potential counterpoint— Manley et al. 2024 record from nearly 1M neurons at cellular resolution in the mouse dorsal cortex, using light beads microscopy, and find an unbounded scaling of dimensionality.4

That said, as a working BCI, it may be difficult to place your stimulation tool exactly where these casually significant neurons are — assuming that for different percepts they’re more widely distributed than the reach of your BCI. I’d assume to recreate a face perception, for example, you’d need an optical BCI that can get single neuron read/write which would have a pretty limited reach (500-1000 microns if you’re using 2-photon).5

Representational drift
If we stick to the assumptions of this experiment, eventual representational drift would also make translation incredibly difficult. Even if you understand every neuron’s tuning function and could write to every relevant neuron to instantiate some percept, I wouldn’t expect those tuning functions to be stable.

What individual neurons respond to changes over time, known as representational drift. Memory engrams move around and neurons that code for components of sensory percepts (like facial features in the IT cortex) change their tuning functions. To make single neuron stimulation for translation work, you’d need to be continually reading from individual neurons and tracking how their tuning functions change over time.

Revisiting assumptions
It’s possible that the assumptions we started this thought experiment with are incorrect — namely that we should be stimulating face patch neurons to change face perception. Perhaps there is a region that the IT cortex feeds into where significantly fewer neurons can be stimulated to drive specific visual perceptual changes.

Thanks to Janis Hesse, Hunter Ozawa Davis, Raffi Hotter, and Quintin Frerichs for helpful conversations and feedback.

A topological solution to object segmentation and tracking (Tsao and Tsao, 2022) feels like an interesting intuition for this. ↩︎
I also should eat my vegetables and better understand why enactivists would think this whole line of inquiry is foolish (to do = writing a critique of this from the pov of a sensorimotor theory of vision and visual consciousness ?) ↩︎
Face patch size may vary / and modeling them as a sphere may not be totally accurate, but I just wanted to generate a rough estimate. ↩︎
From Alipasha Vaziri’s website where he writes about the findings:
“Widespread application of dimensionality reduction to multi-neuron recordings implies that neural dynamics can be approximated by low-dimensional “latent” signals reflecting neural computations. However, what would be the biological utility of such a redundant and metabolically costly encoding scheme and what is the appropriate resolution and scale of neural recording to understand brain function?
Imaging the activity of one million neurons at cellular resolution and near-simultaneously across mouse cortex, we demonstrate an unbounded scaling of dimensionality with neuron number. While half of the neural variance lies within sixteen behavior-related dimensions, we find this unbounded scaling of dimensionality to correspond to an ever-increasing number of internal variables without immediate behavioral correlates. The activity patterns underlying these higher dimensions are fine-grained and cortex-wide, highlighting that large-scale recording is required to uncover the full neural substrates of internal and potentially cognitive processes.” ↩︎
I was also considering this thought experiment in thinking through paradigms of neurotech. This may be a fake paradigm I’ve made up while contemplating what’s possible with neurotech, but I was thinking of this paper and the way I describe the translation approach as like a “top-down” paradigm whereas I’m personally more interested in how something like the corpus callosum works, how people learn new skills, or how biological entities realize they’re part of the same whole or share goals through some interactive process. ↩︎