Abstract: Transformer-based models, such as Bidirectional Encoder Representations from Transformers (BERT), cannot process long sequences because their self-attention operation scales quadratically ...