Transformers output fixed-length sequences. For this transformer they chose 256 ...

Transformers output fixed-length sequences. For this transformer they chose 256 pixels, or 32 "image tokens" that each decode to an 8-by-8 pixel "patch".

You can technically increase or decrease this - or use a different aspect ratio by using more or fewer image tokens, but this is static after you start training. It will also require more "decodes" from the backbone VQGAN model (responsible for converting pixels to image tokens), and thus take longer to run inference on.

CLIP-guided VQGAN can get around this by taking the average CLIP score over multiple "cutouts" of the whole image allowing for a broad range of resolutions and aspect ratios.