Method
BrainDiVE only requires the training of a differentaible voxel-wise fMRI encoder, where the encoder maps from RGB images to predicted voxel-wise brain activations (fMRI betas). The encoder is combined with a latent diffusion model (LDM) to generate naturalistic outputs which are predicted to activate a set of voxels.
In our framework, we leverage OpenCLIP ViT-B/16 as the encoder backbone. The output of the last layer is scaled to unit-norm, then passed through a linear probe with voxel-wise bias to predict the fMRI activations. The diffusion model used is Stable Diffusion v2-1-base, which outputs 512×512 images using ε-prediction (noise prediction). We use the approach proposed by crowsonkb of using first-order DDIM (euler) predicted output with residual noise at each time step, followed by resizing the image to 224×224 as input to the encoder. The desired voxel activations are averaged, then backproped into the diffusion output.
Similar to other image synthesis works like DALL-E, reranking can be done using the encoder itself. We choose to preserve the top-20% of images as done in NeuroGen.