-
Notifications
You must be signed in to change notification settings - Fork 25.2k
ESQL: Limit Replace function memory usage #127924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hi @ivancea, I've created a changelog YAML for you. |
@@ -121,15 +125,15 @@ public boolean foldable() { | |||
return str.foldable() && regex.foldable() && newStr.foldable(); | |||
} | |||
|
|||
@Evaluator(extraName = "Constant", warnExceptions = PatternSyntaxException.class) | |||
@Evaluator(extraName = "Constant", warnExceptions = IllegalArgumentException.class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PatternSyntaxException
is an IllegalArgumentException
. If we choose another exception for this PR, we would have to restore PatternSyntaxException
@@ -39,6 +41,8 @@ | |||
public class Replace extends EsqlScalarFunction { | |||
public static final NamedWriteableRegistry.Entry ENTRY = new NamedWriteableRegistry.Entry(Expression.class, "Replace", Replace::new); | |||
|
|||
static final long MAX_RESULT_LENGTH = MB.toBytes(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is based on the same Repeat logic:
Line 43 in c990377
static final long MAX_REPEATED_LENGTH = MB.toBytes(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to shove this in some common spot. I have no idea what a good one is. And probably javadoc it too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We could make both use the same constant, then future readers can easily see that it's being used in 2 cases which tackle a common problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved it to ScalarFunction
. I don't expect aggs to need this, but we could move it later
do { | ||
m.appendReplacement(result, newStr); | ||
|
||
if (result.length() > MAX_RESULT_LENGTH) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're checking after the replacement. This has a downside: if the replacement uses a group ($1
), it could, theoretically, still generate a big string (E.g. Replace("<100 chars>", ".+", "$0<...x 100>") == 10.000 chars
.
We could take this into account, and use something like:
int matchSize = m.end() - m.start();
int potentialReplacementSize = matchSize * matchSize / 2; // "$0" == 1 repetition
int remainingStr = str.length() - m.end();
if (result.length + potentialReplacementSize + remainingStr > MAX_RESULT_LENGTH) {
throw ...:
}
Opinions? How predictive should we be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the estimate matchSize * newStr.length()
could be too conservative. If I have a 2kb input string, the regex matches the whole thing, and I want to replace it by another 1kb string (no group refs!), this would trigger as a false positive. Not sure how restrictive this is in practice.
Tt looks like appendReplacement
only allocates memory through the string builder even in case of group references, although I could've missed something - only skimmed the implementation. But, if we want to be precise, we could maybe inherit from StringBuilder and make it perform our more strict checks every time we append? I don't know if the effort is worth it.
Maybe a simpler approach is to count how many group references newStr
actually has, but that could tank performance :/
@nik9000, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure either. I was chatting with @ivancea over slack. I kind of think we should take our gloves off and build this directly appending to a BreakingBytesRefBuilder
. That way we're reusing the bytes and don't have to copy to encode to utf-8. We have the JVM as a testing oracle so we can clean room implement it. That's probably a few days of work though. All because we don't have proper monomorphization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just pushed this algorithm, taking into account the number of groups. So it's quite more specific. Still quite conservative, but not that much I think? And if you don't use groups, it's nearly pixel-perfect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix looks correct to me, thanks @ivancea !
@@ -39,6 +41,8 @@ | |||
public class Replace extends EsqlScalarFunction { | |||
public static final NamedWriteableRegistry.Entry ENTRY = new NamedWriteableRegistry.Entry(Expression.class, "Replace", Replace::new); | |||
|
|||
static final long MAX_RESULT_LENGTH = MB.toBytes(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We could make both use the same constant, then future readers can easily see that it's being used in 2 cases which tackle a common problem.
if (result.length() > MAX_RESULT_LENGTH) { | ||
throw new IllegalArgumentException("Creating strings with more than [" + MAX_RESULT_LENGTH + "] bytes is not supported"); | ||
} | ||
} while (m.find()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comparing this with the implementation of replaceAll
, this seems to be equivalent, so I think we don't accidentally change the semantics. Nice.
do { | ||
m.appendReplacement(result, newStr); | ||
|
||
if (result.length() > MAX_RESULT_LENGTH) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the estimate matchSize * newStr.length()
could be too conservative. If I have a 2kb input string, the regex matches the whole thing, and I want to replace it by another 1kb string (no group refs!), this would trigger as a false positive. Not sure how restrictive this is in practice.
Tt looks like appendReplacement
only allocates memory through the string builder even in case of group references, although I could've missed something - only skimmed the implementation. But, if we want to be precise, we could maybe inherit from StringBuilder and make it perform our more strict checks every time we append? I don't know if the effort is worth it.
Maybe a simpler approach is to count how many group references newStr
actually has, but that could tank performance :/
@nik9000, what do you think?
List<Object> simpleData = testCase.getDataValues(); | ||
// Ensure we don't run this test with too much data that could take too long to process. | ||
// The calculation "ramUsed * count" is just a hint of how much data will the function process, | ||
// and the limit is arbitrary | ||
assumeTrue("Input data too big", row(simpleData).ramBytesUsedByBlocks() * count < GB.toBytes(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why's this needed? Did we make REPLACE super slow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I added the test with a lot of data to the ReplaceTests file, this specific case was very slow (3s -> 40s).
Anyway, I just moved the big cases to another file, so they're executed just once. Same as RepeatStaticTests
does. So this isn't needed anymore
…oing the actual replacement
Pinging @elastic/es-analytical-engine (Team:Analytics) |
@@ -39,6 +41,8 @@ | |||
public class Replace extends EsqlScalarFunction { | |||
public static final NamedWriteableRegistry.Entry ENTRY = new NamedWriteableRegistry.Entry(Expression.class, "Replace", Replace::new); | |||
|
|||
static final long MAX_RESULT_LENGTH = MB.toBytes(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to shove this in some common spot. I have no idea what a good one is. And probably javadoc it too.
do { | ||
m.appendReplacement(result, newStr); | ||
|
||
if (result.length() > MAX_RESULT_LENGTH) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure either. I was chatting with @ivancea over slack. I kind of think we should take our gloves off and build this directly appending to a BreakingBytesRefBuilder
. That way we're reusing the bytes and don't have to copy to encode to utf-8. We have the JVM as a testing oracle so we can clean room implement it. That's probably a few days of work though. All because we don't have proper monomorphization.
The Replace string result limit was fixed to 1MB, same as Repeat