[8.17](backport #41817) [aws] [s3] Introduce ignore_older & start_timestamp for S3 input allowing better registry cleanups #42717
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Proposed commit message
Introduce
ignore_olderandstart_timestampproperties to AWS S3 input. This is a follow-up for #41694.The configurations introduced here act as input object filters. If the object fails to match derived filters, the entries will be cleaned up from the registry, reducing filebeat memory consumption.
Introduced configurations are,
For both inputs, the object's last modified timestamp is taken into comparison. See Use cases section for further explanation
Checklist
CHANGELOG.next.asciidocorCHANGELOG-developer.next.asciidoc.Disruptive User Impact
None as defaults are disabled. However, when configurations introduced here are used, the following can have an impact on the user,
start_timestampis defined, then objects with the last modified timestamps prior to the timestamp are ignored from processing (documented 1)ignore_olderis defined, then objects that do not fall within the look-back period when processing starts (polling run) are ignored (documented 1)start_timestamp&ignore_olderare defined, the initial run will process all entries up tostart_timestamp. The subsequent runs will not include entries that do not fall withinignore_oldereven if processing failed for an object. (documented 1)How to test this PR locally
ignore_older&start_timestampto see how data ingestion change with their values. See Use cases section for further explanationRelated issues
aws-s3input's bucket polling accumulates state in the registry #39116Use cases
Consider below diagrams where there're 3 objects Object A, Object B and Object C with their last modified timestamps of t1, t2 and t3.
And consider how filebeat processes and tracks registry entries based on the following scenarios
Default behavior
If none of the configurations are used, then filebeat will process and the internal registry will track all objects continuously unless they are removed from the bucket.
Use start_timestamp
If
start_timestampis used, objects newer than the timestamp are accepted for processing. The registry will grow unless objects are removed from the bucket by other means (ex:- lifecycle policy).Use ignore_older
If
ignore_olderis defined, input will process objects within the provided duration, calculated from the current time. The registry will track objects within the current timeframe and others will get cleaned up eventually by subsequent runs.Use both ignore_older & start_timestamp
If both properties are defined,
ignore_olderduration).ignore_olderduration.This is an automatic backport of pull request #41817 done by [Mergify](https://mergify.com).
Footnotes
https://github.com/elastic/beats/pull/41817/files#diff-422765b7341c5bbf6de7af38927e34e00a5073b188585a7af3c4fee1175b64a6 ↩ ↩2 ↩3
https://github.com/Kavindu-Dodan/data-gen ↩