Skip to content

[Ratelimit Processor] Instrument the ratelimiter service with telemetry metrics #562

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 37 commits into
base: main
Choose a base branch
from

Conversation

gizas
Copy link
Contributor

@gizas gizas commented May 2, 2025

This is the initial implementation for https://github.com/elastic/hosted-otel-collector/issues/513

The code adds a newMetrics reader in the ratelimitprocessor.

At the moment we have implemented only ratelimitRequests metric.Int64Counter
Also the newMetricsReader makes use of the global otel processor that we assume that is initialized upon processor usage.

To be reviewed:

  • if err != nil: WithReason = "request_error" and WithDecision = "accepted"
  • if n := len(responses); n != 1: WithReason ="request_error" and WithDecision = "accepted"
  • if n := len(responses)==1 and resp.GetError() != "": WithReason = "limit_error" and WithDecision ="accepted"
  • if gubernator.Status_OVER_LIMIT: WithReason = "over_limit" and WithDecision ="throttled"
  • Default: "WithReason" = "under_limit" and "WithDecision" ="accepted"

Sample document

After running make smoketest from the hosted-otel-collector repo, I can see this document in Kibana:

{
  "_index": ".ds-metrics-apm.app.hosted_otel_collector-default-2025.04.30-000001",
  "_id": "4sKIsJYBgg67ctLl7qDw",
  "_version": 1,
  "_source": {
    "@timestamp": "2025-05-08T15:34:45.244Z",
    "agent": {
      "name": "otlp",
      "version": "unknown"
    },
    "data_stream": {
      "dataset": "apm.app.hosted_otel_collector",
      "namespace": "default",
      "type": "metrics"
    },
    "labels": {
      "orchestrator_cluster_name": "default",
      "orchestrator_deploymentslice": "",
      "orchestrator_environment": "default",
      "processor_id": "ratelimit",
      "ratelimit_decision": "accepted",
      "reason": "under_limit",
      "x-elastic-project-id": "local"
    },
    "metricset": {
      "name": "app"
    },
    "numeric_labels": {
      "limit_threshold": 0
    },
    "observer": {
      "hostname": "apm-server-apm-server-56d7f679df-fcqdg",
      "type": "apm-server",
      "version": "8.17.0"
    },
    "otelcol_ratelimit": {
      "requests": 6
    },
    "service": {
      "framework": {
        "name": "github.com/elastic/opentelemetry-collector-components/processor/ratelimitprocessor"
      },
      "language": {
        "name": "unknown"
      },
      "name": "hosted-otel-collector",
      "node": {
        "name": "ed945a71-f853-42d9-a324-14092609edd9"
      },
      "version": "git"
    }
  }
}

gizas added 2 commits May 2, 2025 11:12
Signed-off-by: Andreas Gkizas <andreas.gkizas@elastic.co>
Signed-off-by: Andreas Gkizas <andreas.gkizas@elastic.co>
@gizas gizas requested a review from a team as a code owner May 2, 2025 11:41
@gizas gizas requested review from jackshirazi and edmocosta May 2, 2025 11:41
gizas added 2 commits May 2, 2025 14:43
Signed-off-by: Andreas Gkizas <andreas.gkizas@elastic.co>
Signed-off-by: Andreas Gkizas <andreas.gkizas@elastic.co>
gizas added 3 commits May 2, 2025 15:39
Signed-off-by: Andreas Gkizas <andreas.gkizas@elastic.co>
Signed-off-by: Andreas Gkizas <andreas.gkizas@elastic.co>
Signed-off-by: Andreas Gkizas <andreas.gkizas@elastic.co>
gizas added 3 commits May 2, 2025 16:26
Signed-off-by: Andreas Gkizas <andreas.gkizas@elastic.co>
Signed-off-by: Andreas Gkizas <andreas.gkizas@elastic.co>
Signed-off-by: Andreas Gkizas <andreas.gkizas@elastic.co>
@gizas gizas requested review from a team as code owners May 2, 2025 13:38
gizas added 10 commits May 2, 2025 16:39
Signed-off-by: Andreas Gkizas <andreas.gkizas@elastic.co>
Signed-off-by: Andreas Gkizas <andreas.gkizas@elastic.co>
Signed-off-by: Andreas Gkizas <andreas.gkizas@elastic.co>
Signed-off-by: Andreas Gkizas <andreas.gkizas@elastic.co>
Signed-off-by: Andreas Gkizas <andreas.gkizas@elastic.co>
Signed-off-by: Andreas Gkizas <andreas.gkizas@elastic.co>
Signed-off-by: Andreas Gkizas <andreas.gkizas@elastic.co>
Signed-off-by: Andreas Gkizas <andreas.gkizas@elastic.co>
Signed-off-by: Andreas Gkizas <andreas.gkizas@elastic.co>
Copy link
Member

@vigneshshanmugam vigneshshanmugam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach looks good to me. Some thoughts on additional metrics that we would need to introduce

  • Request duration
  • Can we measure the no of requests that are over the limits at a given time? I believe we can by using the decision dimension, just confirm if thats is the case.
  • Concurrent requests

@lahsivjar Could you think of additional metrics that were useful in MIS so we can add them here? Thanks.

}

func newGubernatorRateLimiter(cfg *Config, set processor.Settings) (*gubernatorRateLimiter, error) {
var behavior int32

telemetryBuilder, err := metadata.NewTelemetryBuilder(set.TelemetrySettings)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add this also to the local rate limiter, to make sure we track all the incoming requests to this component. I would move this out.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vigneshshanmugam Could you explain a bit more here please? Sorry I'm lost 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was definitely without much context, apologeis. What I meant was to add it to

func (r *localRateLimiter) RateLimit(ctx context.Context, hits int) error {
as well as there is way to disable gubernator and use local in-memory ratelimier instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you think of additional metrics that were useful in MIS so we can add them here? Thanks.

Currently in MIS we collect only 2 metrics from the rate limiter (ref):

  1. A metric to track the status of the ingest requests. This metric has dimensions with bounded cardinality to record the decision and the reason for the decision by the rate limiter service.
  2. A metric to track the update requests sent between gubernator instances to update them. This metric has just one dimension of status.

While I like the idea of collecting request duration, however, I am not sure how useful concurrent requests would be. We should also be wary of adding too many metrics in hot path, as it does add some overhead.

Co-authored-by: Vignesh Shanmugam <vignesh.shanmugam22@gmail.com>
Signed-off-by: Andreas Gkizas <andreas.gkizas@elastic.co>
gizas and others added 8 commits May 5, 2025 22:47
@kaiyan-sheng kaiyan-sheng self-assigned this May 7, 2025
monotonic: true
attributes: ["project_id", "processor_id", "ratelimit_decision", "limit_threshold", "reason"]

attributes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see this comment #562 (comment), We dont want to expose the project_id etc details, instead keep this to attributes being derived from metadata_keys. Not sure if thats something possible with the metadata schema, but we can come back to later as well.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh sorry I missed it. Will remove project_id. Yeah I dont think the current mdatagen supports dynamic attribute keys. Can look into it more later.
How about processor_id then? ProcessorID is from telemetry.WithProcessorID(r.set.ID.String()).

@kaiyan-sheng
Copy link

kaiyan-sheng commented May 7, 2025

Todo list in followup PR:

  • Look into map type attribute for metadata_keys
  • Add more telemetry metrics like request duration and etc

@kaiyan-sheng kaiyan-sheng changed the title [Ratelimit Processor] Instrument the ratelimiter service with custom metrics/ tracer [Ratelimit Processor] Instrument the ratelimiter service with telemetry metrics May 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants