Skip to content

ESQL: Limit Replace function memory usage #127924

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

ivancea
Copy link
Contributor

@ivancea ivancea commented May 8, 2025

The Replace string result limit was fixed to 1MB, same as Repeat

@ivancea ivancea requested review from nik9000 and alex-spies May 8, 2025 16:17
@ivancea ivancea added >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) :Analytics/ES|QL AKA ESQL v9.1.0 labels May 8, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @ivancea, I've created a changelog YAML for you.

@@ -121,15 +125,15 @@ public boolean foldable() {
return str.foldable() && regex.foldable() && newStr.foldable();
}

@Evaluator(extraName = "Constant", warnExceptions = PatternSyntaxException.class)
@Evaluator(extraName = "Constant", warnExceptions = IllegalArgumentException.class)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PatternSyntaxException is an IllegalArgumentException. If we choose another exception for this PR, we would have to restore PatternSyntaxException

@@ -39,6 +41,8 @@
public class Replace extends EsqlScalarFunction {
public static final NamedWriteableRegistry.Entry ENTRY = new NamedWriteableRegistry.Entry(Expression.class, "Replace", Replace::new);

static final long MAX_RESULT_LENGTH = MB.toBytes(1);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is based on the same Repeat logic:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to shove this in some common spot. I have no idea what a good one is. And probably javadoc it too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We could make both use the same constant, then future readers can easily see that it's being used in 2 cases which tackle a common problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved it to ScalarFunction. I don't expect aggs to need this, but we could move it later

do {
m.appendReplacement(result, newStr);

if (result.length() > MAX_RESULT_LENGTH) {
Copy link
Contributor Author

@ivancea ivancea May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're checking after the replacement. This has a downside: if the replacement uses a group ($1), it could, theoretically, still generate a big string (E.g. Replace("<100 chars>", ".+", "$0<...x 100>") == 10.000 chars.
We could take this into account, and use something like:

int matchSize = m.end() - m.start();
int potentialReplacementSize = matchSize * matchSize / 2; // "$0" == 1 repetition
int remainingStr = str.length() - m.end();
if (result.length + potentialReplacementSize + remainingStr > MAX_RESULT_LENGTH) {
    throw ...:
}

Opinions? How predictive should we be?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the estimate matchSize * newStr.length() could be too conservative. If I have a 2kb input string, the regex matches the whole thing, and I want to replace it by another 1kb string (no group refs!), this would trigger as a false positive. Not sure how restrictive this is in practice.

Tt looks like appendReplacement only allocates memory through the string builder even in case of group references, although I could've missed something - only skimmed the implementation. But, if we want to be precise, we could maybe inherit from StringBuilder and make it perform our more strict checks every time we append? I don't know if the effort is worth it.

Maybe a simpler approach is to count how many group references newStr actually has, but that could tank performance :/

@nik9000, what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure either. I was chatting with @ivancea over slack. I kind of think we should take our gloves off and build this directly appending to a BreakingBytesRefBuilder. That way we're reusing the bytes and don't have to copy to encode to utf-8. We have the JVM as a testing oracle so we can clean room implement it. That's probably a few days of work though. All because we don't have proper monomorphization.

Copy link
Contributor Author

@ivancea ivancea May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pushed this algorithm, taking into account the number of groups. So it's quite more specific. Still quite conservative, but not that much I think? And if you don't use groups, it's nearly pixel-perfect

Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix looks correct to me, thanks @ivancea !

@@ -39,6 +41,8 @@
public class Replace extends EsqlScalarFunction {
public static final NamedWriteableRegistry.Entry ENTRY = new NamedWriteableRegistry.Entry(Expression.class, "Replace", Replace::new);

static final long MAX_RESULT_LENGTH = MB.toBytes(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We could make both use the same constant, then future readers can easily see that it's being used in 2 cases which tackle a common problem.

if (result.length() > MAX_RESULT_LENGTH) {
throw new IllegalArgumentException("Creating strings with more than [" + MAX_RESULT_LENGTH + "] bytes is not supported");
}
} while (m.find());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing this with the implementation of replaceAll, this seems to be equivalent, so I think we don't accidentally change the semantics. Nice.

do {
m.appendReplacement(result, newStr);

if (result.length() > MAX_RESULT_LENGTH) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the estimate matchSize * newStr.length() could be too conservative. If I have a 2kb input string, the regex matches the whole thing, and I want to replace it by another 1kb string (no group refs!), this would trigger as a false positive. Not sure how restrictive this is in practice.

Tt looks like appendReplacement only allocates memory through the string builder even in case of group references, although I could've missed something - only skimmed the implementation. But, if we want to be precise, we could maybe inherit from StringBuilder and make it perform our more strict checks every time we append? I don't know if the effort is worth it.

Maybe a simpler approach is to count how many group references newStr actually has, but that could tank performance :/

@nik9000, what do you think?

Comment on lines 309 to 313
List<Object> simpleData = testCase.getDataValues();
// Ensure we don't run this test with too much data that could take too long to process.
// The calculation "ramUsed * count" is just a hint of how much data will the function process,
// and the limit is arbitrary
assumeTrue("Input data too big", row(simpleData).ramBytesUsedByBlocks() * count < GB.toBytes(1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why's this needed? Did we make REPLACE super slow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I added the test with a lot of data to the ReplaceTests file, this specific case was very slow (3s -> 40s).

Anyway, I just moved the big cases to another file, so they're executed just once. Same as RepeatStaticTests does. So this isn't needed anymore

@ivancea ivancea marked this pull request as ready for review May 9, 2025 14:51
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@@ -39,6 +41,8 @@
public class Replace extends EsqlScalarFunction {
public static final NamedWriteableRegistry.Entry ENTRY = new NamedWriteableRegistry.Entry(Expression.class, "Replace", Replace::new);

static final long MAX_RESULT_LENGTH = MB.toBytes(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to shove this in some common spot. I have no idea what a good one is. And probably javadoc it too.

do {
m.appendReplacement(result, newStr);

if (result.length() > MAX_RESULT_LENGTH) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure either. I was chatting with @ivancea over slack. I kind of think we should take our gloves off and build this directly appending to a BreakingBytesRefBuilder. That way we're reusing the bytes and don't have to copy to encode to utf-8. We have the JVM as a testing oracle so we can clean room implement it. That's probably a few days of work though. All because we don't have proper monomorphization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants