Fully Sharded Data Parallel by stephenroller · Pull Request #3740 · facebookresearch/ParlAI

stephenroller · 2021-06-22T20:34:35Z

Patch description
Add support for Fairscale's FullyShardedDataParallel (FSDP). This is an implementation of DeepSpeed's Zero2 optimization, wherein optimizer state and gradients are sharded across different workers in order to reduce memory usage. Switching to --ddp-backend zero2 results in about a 25% speedup in UPS (without bg workers, probably can be a bit higher), and about a 50% reduction in memory usage. It's recommended everyone switches to this for distributed training, and use the savings to increase batchsize or lower number of GPUs.

We also carve out support for Zero3, but cannot support it at this time due to high level design in ParlAI. See #3753 for a detailed description of why, and how we might overcome this in the future.

As a side change, this also makes our unit tests use OS-assigned free ports, instead of randomized ones, to slightly improve the reliability of running our test suites. I tried pulling this into another PR, but got tired of dealing with stacking.

Testing steps
Manual tests. New CI.

Here are some screenshots from a sweep that contained both --ddp-backend ddp and --ddp-backend zero2

stephenroller · 2021-06-28T14:06:14Z

See #3753 for why Zero3 won't be supported in this implementation.

EricMichaelSmith

Seems reasonable - minor comments

EricMichaelSmith · 2021-06-30T18:36:27Z

parlai/core/params.py

        )
+        grp.add_argument(
+            '--ddp-backend',
+            choices=['ddp', 'zero2', 'zero3'],


Hmm should we even give 'zero3' as an option for the time being? (Don't really care either way)

EricMichaelSmith · 2021-06-30T18:38:17Z

parlai/utils/fsdp.py

+
+def should_sync_gradnorm(opt):
+    """
+    Indicates whether fp16 optimizer wrappers should cumulate over workers.


Nit: "accumulate"?

EricMichaelSmith · 2021-06-30T18:41:47Z

parlai/core/torch_agent.py

+
+        For models or optimizers that shard parameters, this ensures we sync.
+        """
+        if self.opt.get('ddp_backend', 'ddp') in ('zero2', 'zero3'):


Nit: should we pull in DEFAULT_DDP_BACKEND here?

EricMichaelSmith · 2021-06-30T18:43:03Z

parlai/core/torch_generator_agent.py

+        if (
+            shared is None
+            and is_distributed()
+            and opt.get('ddp_backend', 'ddp') == 'ddp'


(same here about maybe using DEFAULT_DDP_BACKEND instead)

parlai/utils/distributed.py

tests/test_distributed.py

klshuster

really really cool. lots of nits though (and a few real questions 😄 )

klshuster · 2021-07-01T14:17:23Z

parlai/core/torch_agent.py

        if hasattr(self, 'model'):  # save model params
-            if hasattr(self.model, 'module'):
-                # did we wrap in a DistributedDataParallel
+            if hasattr(self.model, 'module') and not is_fsdp(self.model):


nit: could make this a helper function too? like should_sync_gradnorm (not necessary of course)

klshuster · 2021-07-01T14:18:36Z

parlai/core/torch_generator_agent.py

-            self.model = self.build_model()
+            with fsdp_utils.maybe_fsdp_wrap(opt):
+                self.model = fsdp_utils.fsdp_wrap(self.build_model())
+                if self.fp16 and not fsdp_utils.should_use_fsdp(opt):


remember that bug with the instability stuff? is this not re-introducing it?

(because we moved the model.half() call?)

Okay I think this needs to use my utility should_delay_halving. Forgot this.

We haven't really moved it the moment of halving. The operations between these two points don't do much, and the original code path should be about the same.

We now half it on CPU instead of GPU, and then transfer. That's probably a small speedup in initialization really, with maybe some small numerical differences

We model parallel after halving. Probably small speedup at initialization.

We synchronize parameters after halving. Again, small initialization speedup.

The catch is that FSDP expects the model pre-halved if we're doing safe optimization, and post-halved if we're doing memory-efficient. (Similar to the optimizer wrappers, it looks for parameters of types to decide what type are the gradients).

This is the desired pattern

If we're in Safe and using DDP, we SHOULD still halve, just as before

If we're in MemEff and using DDP, we SHOULD still halve, just as before

If we're in Safe and Zero2, we should NOT halve here

If we're in MemEff and Zero2, we SHOULD halve here.

klshuster · 2021-07-01T14:19:13Z

parlai/scripts/multiprocessing_train.py



-def launch_and_train(opt, port):
+def launch_and_train(opt, port=None):


will we ever specify a port here?

klshuster · 2021-07-01T14:20:15Z

parlai/scripts/train_model.py

            self.best_valid = new_valid
            self.impatience = 0
-            if opt.get('model_file') and is_primary_worker():
+            if opt.get('model_file'):


just making sure I understand - we can get rid of this check because it's handled in save_model right?

We need to be able do save_on_nonprimary_worker actually

klshuster · 2021-07-01T14:23:13Z

parlai/utils/fp16.py

+        if max_norm > 0:
+            clip_coef = max_norm / (grad_norm + 1e-6)
+            for p in params:
+                p.grad.detach().mul_(clip_coef)


why do we detach here?

Don't want grads of grads! (This is in the original pytorch code too)

klshuster · 2021-07-01T14:24:30Z

parlai/utils/fsdp.py

+        return
+
+    # zero3 not supported at this time. Throw an exception
+    if opt['ddp_backend'] == 'zero3':


i know this is just for overkill testing but it's not even a choice in the param options so we'll already error there if calling from command line

I'm leaving it for the future

klshuster · 2021-07-01T14:25:20Z

parlai/utils/fsdp.py

+    return (
+        self.fp16
+        and self.opt.get('ddp_backend', 'ddp') in ('zero2', 'zero3')
+        and self.opt['fp16_impl'] == 'safe'


but if we're using mem_efficient we don't delay?

Correct, see main comment

stephenroller added 9 commits June 21, 2021 21:19

Implement zero2 and zero3

a5b4da4

Implement overflow syncing.

ea9390c

Tweak log statements.

378eacc

Use free ports rather than random ports

0ea7d3a

Refactor test_distributed

fc3e668

More refactor.

65ad526

Fixup checkpoints.

3153dd8

Merge branch 'freeport' into fsdp

82f8b01

Get tests working.

44fcdfc

facebook-github-bot added the CLA Signed label Jun 22, 2021

stephenroller added 3 commits June 22, 2021 16:56

GPU only

281efd1

Sigh

4146d86

Moar.

5c6755a

stephenroller mentioned this pull request Jun 22, 2021

[mp] Use free ports rather than random ports #3739

Closed

stephenroller added 8 commits June 23, 2021 14:31

Trying to sync grad norms

dc5edc3

Correctly implement gnorm syncing.

7e12292

Update comment.

66d53d3

Merge branch 'master' into fsdp

0bb9995

Try zero3.

1cb30d1

Okay got zero3 working.

5cea3b2

Refactor.

490f5d8

Get FSDP Zero3 working, except during validation.

31dfeb5

stephenroller mentioned this pull request Jun 28, 2021

Add support for Zero3 FSDP #3753

Open

stephenroller added 5 commits June 28, 2021 10:06

Merge branch 'master' into fsdp

ded3708

Check in missing code. Carve out notimplemented.

d095f51

Lint.

f17abb2

Er.

231e88d

Add a test to ensure we keep track of zero3 not working.

4a3ce86

stephenroller requested a review from spencerp June 28, 2021 14:49

stephenroller requested review from EricMichaelSmith and klshuster June 28, 2021 14:49

stephenroller marked this pull request as ready for review June 28, 2021 14:49

stephenroller added 2 commits June 28, 2021 10:58

Remove debugs, add docstrings, rename variable.

98a90b7

Silly

a2f84c1

stephenroller requested a review from moyapchen June 28, 2021 17:28

Merge branch 'master' into fsdp

e45b149

EricMichaelSmith approved these changes Jun 30, 2021

View reviewed changes

stephenroller added 3 commits July 1, 2021 10:05

Reviewer comments.

61b64dc

Lint.

16374c9

We disabled zero3 as an option, so don't need the test.

074be0a

klshuster approved these changes Jul 1, 2021

View reviewed changes

stephenroller added 2 commits July 1, 2021 11:37

Bug caught by Kurt.

0814c99

Rofl

c5a82aa

stephenroller merged commit 0b9afe8 into master Jul 1, 2021

stephenroller deleted the fsdp branch July 1, 2021 20:28

stephenroller mentioned this pull request Jul 14, 2021

Add support for fairscale sharded ddp #3415

Closed



		def launch_and_train(opt, port):
		def launch_and_train(opt, port=None):

Conversation

stephenroller commented Jun 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stephenroller commented Jun 28, 2021

Uh oh!

EricMichaelSmith left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

klshuster left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

stephenroller commented Jun 22, 2021 •

edited

Loading