Documentation

Monitor

About monitoring the control plane

A set of tools can optionally be deployed:

  • monitoring leveraging Prometheus metrics collector, and Grafana Dashboard, to display metrics about the resource usage of Aporeto control plane
  • tracing leveraging jaeger tracing, to be able to trace and debug issues within the REST API.

In this section you will learn how to monitor your Aporeto control plane.

Monitoring

Enable monitoring

The first time you create your voila environment if you did not set any URL for the monitoring facilities, they are not deployed.

You can enable the different facilities as explained below:

To enable tracing facility:

set_value global.public.tracing https://tracing.mycompany.tld override

This will deploy the tracing facility that will allow you look at every API request done to the control plane. In general that’s a pretty advanced facility to help during development or analyze slow requests.

To enable metrics facility:

set_value global.public.monitoring https://tracing.mycompany.tld override

This will deploy Prometheus to scrape service data and serve it through Grafana dashboards. It also enables alerts (see below).

To enable logging facility:

set_value global.public.logging https://tracing.mycompany.tld override

This will deploy Loki to scrape all service logs. We recommend enabling this only during development or to analyze crashes.

To deploy a metrics proxy that can be used to remotely query alerts and metrics and perform Prometheus federation:

set_value global.public.metrics https://tracing.mycompany.tld override

NOTE

You can pick which facility you want to to deploy depending on what your needs are.

Then run:

doit

Monitoring dashboards

Depending on what monitoring tools you have enabled, you can access one or both of the following dashboards:

  • the metrics / logging dashboard through Grafana (the one set via global.public.monitoring)
  • the tracing dashboard through Jaeger Query console (the one set via global.public.tracing)

To access them you need to have an auditer certificate.

Using the gen-auditer tool from your voila environment.

This will generate a p12 certificate file in certs/auditers that you will need to import on your workstation.

Then you can visit the URL set in the configuration:

Get the monitoring dashboard URL:

get_value global.public.monitoring

Get the tracing dashboard URL:

get_value global.public.tracing

Dashboard metrics

Several metrics are collected using prometheus and shown as dashboards through grafana

Grafana

You can find as dashboards:

  • Aporeto general overview that will give you the general state of the Aporeto control plane (above)
  • Backend API will provide details about API calls made to the control plane.
  • Backend services to focus on specific microservices (mostly used by Aporeto engineering)
  • Mongodb status to monitor MongoDB database (mostly used by Aporeto engineering)
  • Resources allocation and Kubernetes node usage to get information about resources used and node usage.

NOTE

You can get logged in as an admin in Grafana if needed. The username is admin and you can get the password with get_value global.accounts.grafana.pass.

By default there is no data persistency on the dashboards, if you want to perform some persistent changes, you can enable the persistency by adding storage to Grafana with:

  • set_value storage.class <sc> grafana override

Where <sc> is the storage class of your Kubernetes cluster.

Then update the Grafana deployment with

snap -u grafana --force

Alerting

Using the metrics gathered some alerts are defined to check the health of your Aporeto control plane and report issues.

Alerts about the control plane:

  • Certificate about to expire: The certificate used for public access is about to expire
  • API down: The public API looks down or there is no clients connected.
  • Internal API down: The internal API is down
  • Aporeto service crashed: A service crashed (should be reported to engineering)
  • Aporeto service restarted: A service restarted
  • Aporeto service Down: A service is down
  • Aporeto slow response: A service is slow
  • Aporeto error (>=500): A service mishandled a request (should be reported to engineering)
  • Infra service down: An infra service is down
  • MongoDB node down: A MongoDB node is down
  • Low space left on volume: A service almost exhausted its storage (this must be addressed)

Alerts about Kubernetes HPA (auto scaling) and nodes auto scaling

  • Node auto scale info: When a new node is added or removed
  • Service auto scale info: When a service is scaling up or down.
  • Service auto scaled max: When a service hit the scale limit.

NOTE

Alerts can be sent to a Slack channel by configuring the following:

set_value global.integrations.slack.webhook "https://hooks.slack.com/services/XXX/YYY/ZZZ"
set_value global.integrations.slack.channel "#mychannel"

Then update the Prometheus deployment with

snap -u prometheus-aporeto --force

Tracing

The tracing tool allow you to trace a request, this is used by engineering to narrow down issues with services.

tracing

Logging

Services logs are also available through loki. Accessible via grafana using the explore function.

Metrics proxy

Aporeto uses Prometheus to gather statistics on all microservices and Kubernetes endpoints.

The metrics proxy allows you to expose those metrics to perform Prometheus federation for instance.

You will need to generate a client certificate that will be used to access the Prometheus federation endpoint:

Generate a client certificate with the following command:

gen-colonoscope

Now you will need to configure the client that will scrape the data from the Prometheus federation endpoint with the following parameters:

  • the metrics endpoint you set above (get_value global.public.metrics)
  • the Prometheus endpoint certificate authority (located in certs/ca-chain-public.pem) the client certificate generated above (located in certs/colonoscopes/<name>-cert.pem) the client key associated to the client certificate (located in certs/colonoscopes/<name>-key.pem)

Once done the following endpoints will be available:

  • https://<fqdn>/alertmanager/api/v1/alerts:

(where <fqdn> is what you configured as global.public.metrics).

This can be used to pull alerts as documented on Prometheus website.

Example of output:

  "status": "success",
  "data": [
    {
      "labels": {
        "alertname": "Backend service restarted",
        "color": "warning",
        "exported_pod": "canyon-75c9f966dc-g7rgj",
        "icon": ":gear:",
        "prometheus": "default/aporeto",
        "reason": "Completed",
        "recover": "false",
        "severity": "severe"
      },
      "annotations": {
        "summary": "canyon-75c9f966dc-g7rgj restarted. Reason: Completed."
      },
      "startsAt": "2020-01-22T20:59:10.521318559Z",
      "endsAt": "2020-01-22T21:02:10.521318559Z",
      "generatorURL": "http://prometheus-aporeto-0:9090/graph?g0.expr=%28%28sum+by%28exported_pod%29+%28kube_pod_container_status_restarts_total%29+-+sum+by%28exported_pod%29+%28kube_pod_container_status_restarts_total+offset+1m%29%29+%21%3D+0%29+-+on%28exported_pod%29+group_right%28%29+count+by%28exported_pod%2C+reason%29+%28kube_pod_container_status_last_terminated_reason+%3E+0%29&g0.tab=1",
      "status": {
        "state": "active",
        "silencedBy": [],
        "inhibitedBy": []
      },
      "receivers": [
        "norecover"
      ],
      "fingerprint": "5a483f5586d6de87"
    }
  ]
}
  • https://<fqdn>/prometheus/federate: used to federate Prometheus instances together.

Example result:

curl -k https://<fqdn>/prometheus/federate --cert ./certs/colonoscopes/example-cert.pem --key ./certs/colonoscopes/example-key.pem -G --data-urlencode 'match[]={type=~"aporeto|database"}'

This request will pull all current metrics.

Subset of output:

http_requests_total{code="200",endpoint="health",instance="10.64.241.42:1080",job="health-cactuar",method="GET",namespace="default",pod="cactuar-5cdddc64c7-sfwp8",service="cactuar",type="aporeto",url="/oidcproviders",prometheus="default/aporeto",prometheus_replica="prometheus-aporeto-0"} 2 1579736744156
http_requests_total{code="200",endpoint="health",instance="10.64.241.42:1080",job="health-cactuar",method="POST",namespace="default",pod="cactuar-5cdddc64c7-sfwp8",service="cactuar",type="aporeto",url="/appcredentials",prometheus="default/aporeto",prometheus_replica="prometheus-aporeto-0"} 31 1579736744156
http_requests_total{code="200",endpoint="health",instance="10.64.241.42:1080",job="health-cactuar",method="POST",namespace="default",pod="cactuar-5cdddc64c7-sfwp8",service="cactuar",type="aporeto",url="/servicetoken",prometheus="default/aporeto",prometheus_replica="prometheus-aporeto-0"} 1689 1579736744156
http_requests_total{code="200",endpoint="health",instance="10.64.241.42:1080",job="health-cactuar",method="PUT",namespace="default",pod="cactuar-5cdddc64c7-sfwp8",service="cactuar",type="aporeto",url="/appcredentials/:id",prometheus="default/aporeto",prometheus_replica="prometheus-aporeto-0"} 25 1579736744156

Example of Prometheus configuration used to scrape data from the Aporeto Prometheus instance:

scrape_configs:
  - job_name: "federate"
    scheme: https
    scrape_interval: 15s
    tls_config:
      ca_file: path-to-ca-cert.pem
      cert_file: path-to-client-cert.pem
      key_file: path-to-client-cert-key.pem
      insecure_skip_verify: false

    honor_labels: true
    metrics_path: "/prometheus/federate"

    params:
      "match[]":
        - '{type=~"aporeto|database"}'

    static_configs:
      - targets:
          - "<fqdn>"