-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
Feature request
Is it possible to build a LoRA that doesn't inject into the transformer? This would allow for reusing the same basic transformer with multiple adapters in the same process while saving on GPU memory (probably at the expense of some speed)
Motivation
We've started using PEFT with LoRA for tasks such as sentiment analysis and constituency parsing in Stanza, and one thing we found is that there is currently no memory savings compared to using a fully finetuned transformer.
For example, if the transformer loaded for sentiment analysis takes 3GB, with no finetuning we can reuse the same transformer weights when constituency parsing, making for a total of 3GB plus the prediction heads of the models. If we use fully FT transformers, obviously that increases to 6GB assuming those are our only two tasks.
PEFT with LoRA uses inject_adapter_in_model to update the model with the As and Bs, AFAIK, meaning that loading those two models still takes 6GB. If we could have a version of the transformer which does inference with the As and Bs not injected, but wrapping the base transformer's tensors, this would almost certainly be noticeably slower but would allow for a much smaller memory footprint.
Thanks for the extremely useful library, BTW
Your contribution
I probably don't have much time to investigate this in the next couple months, but in the long term it is something I could attempt with some guidance on where to look