my point with this is that AI inference is a memory trade. Batching helps until it doesn’t. beyond a certain batch size, the KV cache takes over as the limiting factor: every extra user and every extra context token adds memory that must be read again and again during decode. Memory bandwidth bot