-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
ENH: speed up wide DataFrame.line plots by using a single LineCollection #61764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
7bf84c2 to
0febdd9
Compare
pandas/plotting/_matplotlib/core.py
Outdated
| threshold = 200 # switch when DataFrame has more than this many columns | ||
| can_use_lc = ( | ||
| not self._is_ts_plot() # not a TS plot | ||
| and not self.stacked # stacking not requested | ||
| and not com.any_not_none(*self.errors.values()) # no error bars | ||
| and len(self.data.columns) > threshold | ||
| ) | ||
| if can_use_lc: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer not to have a special casing like this because it's difficult to maintain parity between a "fast path" and the existing path.
Is there a way to refactor our plotting here to generalize the plotting to this form rather than the iterative approach below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion, @mroeschke
Removed the early-return fast path; use_collection now only decides how we draw after the shared loop, so there’s one unified code path.
Let me know if you’d like anything tweaked.
f4f499e to
0febdd9
Compare
0febdd9 to
308f6a6
Compare
|
@jbrockmendel Done in the latest commit, thanks! |
| label_str = self._mark_right_label(pprint_thing(label), index=i) | ||
| kwds["label"] = label_str | ||
|
|
||
| if use_collection: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not generally fond of having a different code path if some condition is met, especially since the condition is requires a magic number threshold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a reasonable concern. is there a downside to always using LineCollection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mroeschke @jbrockmendel, If we want to completely get rid of the path split and the magic threshold number, i have to patch few things:
-
pandas/plotting/_matplotlib/core.py (LinePlot._make_plot)
• Remove the current threshold (use_collection) condition.
• Always render DataFrame line plots using a single LineCollection.
• Add tiny proxy Line2D objects (invisible) to keep legends working as usual.
• Stacked plots and error-bar plots remain unchanged (they use separate code paths already). -
pandas/plotting/_matplotlib/tools.py
• Adjust get_all_lines to return segments from any existing LineCollection.
(Needed for existing tests and autoscaling.)
• Adjust get_xlim similarly to compute limits directly from the LineCollection vertices. -
pandas/tests/plotting/common.py and plotting tests
• Update tests to handle the new structure. Instead of direct access like ax.lines[...], tests will use a helper function aware of the new single-collection setup. -
Documentation and Release Notes
• Clearly note in docs/whatsnew that ax.lines will be empty for DataFrame line plots.
• Users accessing line data directly should switch to pandas.plotting.get_all_lines(ax) or check ax.collections[0].
(No changes for Series plots or other plot types like scatter, area, bar, etc.)
⸻
Advantages:
• One simple and predictable rendering path for all DataFrame line plots.
• Significant speed-up for large DataFrames, negligible overhead for small DataFrames.
• Lower memory use (single artist instead of many) and easier future maintenance.
⸻
Potential Downsides (but manageable):
• Users relying on ax.lines[i] directly must adapt (addressed clearly in docs and deprecation shim).
• Interactive plots using “picker” callbacks may need minor code updates.
• A small batch of tests will need straightforward adjustments.
If you're comfortable with this, i can start an implementation. Are there any additional concerns i should keep in mind before coding?
Happy to iterate!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EvMossan the unfortunate situation is that there aren't any maintainers with expertise in matplotlib, so the idea of reviewing everything you described is daunting. Is there a minimal version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel, I’ve tried every variant I can think of, but I still can’t get a single-path implementation that both preserves the ~5× speed-up and passes the full test suite-at this point I’m stuck and would welcome any ideas or guidance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and passes the full test suite
How many/bad are the failures we're looking at? e.g. no one really cares about ax.lines[0] or whatever as long as the graphs look right.
The ideas that come to mind are 1) convince @mroeschke to be OK with multiple code paths, 2) ask a matplotlib maintainer for help, 3) spend a lot of time on this myself, 4) decide the affected tests are OK to change.
I'm hoping that 4 is viable. Keep in mind that if we go that route, you're tacitly volunteering to have me ping you next time an issue comes up in this part of the code/tests.
|
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
What does this PR change?
DataFrame.plot(kind="line")when the frame is “wide”.RangeIndexor integer/float values), is not a time-series plot, has no stacking
and no error bars, we now draw everything with a single
matplotlib.collections.LineCollectioninstead of oneLine2Dper column.cases above.
Performance numbers
df.plot(legend=False)Benchmarked on pandas 3.0.0.dev0+2183.g94ff63adb2, matplotlib 3.10.3, NumPy 2.2.6
Notes
DatetimeIndexplots—those remain on the original per-column path. A follow-up could combineLineCollectionwith thex_compat=Trueworkaround (see #61398) to similarly speed up time-series plots.> 200columns) is a heuristic and can be tuned in review.indices still use the original per-column draw, so behaviour there is
unchanged.
DataFrame.plotusingLineCollection#61532pytest pandas/tests/plotting -q)pre-commit run --all-files)doc/source/whatsnew/v3.0.0.rstcc @shadnikn @arthurlw – happy to take any feedback 🙂