What is the core mechanism that makes the Transformer architecture so effective for language tasks?