This model inherits from PreTrainedModel. Check the superclass documentation for the generic strategies the
working on byte-sized tokens, transformers scale poorly as every token should "go to" to every other token https://keziaitef507672.tinyblogging.com/everything-about-mamba-paper-73694408