Skip to content

[Bug] Message may loss caused by broker expansion. #9307

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
humkum opened this issue Apr 1, 2025 · 2 comments
Open
3 tasks done

[Bug] Message may loss caused by broker expansion. #9307

humkum opened this issue Apr 1, 2025 · 2 comments

Comments

@humkum
Copy link
Contributor

humkum commented Apr 1, 2025

Before Creating the Bug Report

  • I found a bug, not just asking a question, which should be created in GitHub Discussions.

  • I have searched the GitHub Issues and GitHub Discussions of this repository and believe that this is not a duplicate.

  • I have confirmed that this bug belongs to the current repository, not other repositories of RocketMQ.

Runtime platform environment

CentOS7.10

RocketMQ version

develop

JDK Version

1.8

Describe the Bug

There may be message loss if we expand the topic to the new broker while the consumer stops for a while. The consumer would consume from the max offset of queues in new broker, which didn't have the consumption offset of the consumer.

Steps to Reproduce

  1. The consumer was stopped a while, or the sdk has a bug that unable to perceive topic route changes.
  2. Expand topic to new broker, consume offset not yet generated.
  3. Produce messages for a long time.
  4. (Re)Start the consumer.

What Did You Expect to See?

The consumer can consume all messages in new broker expanded.

What Did You See Instead?

The consumer consume from the max offset, some messages were skipped.

Additional Context

After investigation, it is found that when the consumption position does not exist on the broker of the server side, the server will check if the difference between the maximum position of the commitLog and the physical offset of the message corresponding to the consumption offset exceeds 40% of the memory size. If so, an exception will be thrown and the consumption offset will not be returned. When the consumption client cannot obtain the consumption offset, it will obtain the maxOffset for consumption according to the default configuration of ConsumeFromWhere on the client side.

Image

Image

@gaoyf
Copy link
Contributor

gaoyf commented Apr 2, 2025

After investigation, it is found that when the consumption position does not exist on the broker of the server side, the server will check if the difference between the maximum position of the commitLog and the physical offset of the message corresponding to the consumption offset exceeds 40% of the memory size. If so, an exception will be thrown and the consumption offset will not be returned.

  1. Doesn't this just show that RocketMQ has considered everything? FYI.
  2. How long has your consumer been stalled so that messages exceed 40% of memory?

If your consumer cannot perceive the routing change, you should investigate the consumer, not the broker.
Because it is not just expansion, many situations can cause routing changes. If the consumer cannot perceive it, it will cause various problems.

@humkum
Copy link
Contributor Author

humkum commented Apr 2, 2025

After investigation, it is found that when the consumption position does not exist on the broker of the server side, the server will check if the difference between the maximum position of the commitLog and the physical offset of the message corresponding to the consumption offset exceeds 40% of the memory size. If so, an exception will be thrown and the consumption offset will not be returned.

  1. Doesn't this just show that RocketMQ has considered everything? FYI.
  2. How long has your consumer been stalled so that messages exceed 40% of memory?

If your consumer cannot perceive the routing change, you should investigate the consumer, not the broker. Because it is not just expansion, many situations can cause routing changes. If the consumer cannot perceive it, it will cause various problems.

Not only in the case of consumer cannot perceive the routing change. As I said, this problem will also occur when the consumer stops for a while during this period.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants