Why did Transformer models largely replace RNNs (recurrent neural networks) for language modelling?