Identity-Based Microsegmentation Guide
LAST UPDATED: October 8, 2021

Monitor

Checking the health of a Microsegmentation Console

The Microsegmentation Console provides a healthchecks endpoint that does not require authentication. An example query follows.

curl https://microsegmentation.acme.com/healthchecks\?quiet\=true | jq

Example response:

[
  {
    "responseTime": "195µs",
    "status": "Operational",
    "type": "Cache"
  },
  {
    "responseTime": "2ms",
    "status": "Operational",
    "type": "Database"
  },
  {
    "responseTime": "346µs",
    "status": "Operational",
    "type": "MessagingSystem"
  },
  {
    "responseTime": "0s",
    "status": "Operational",
    "type": "Service"
  },
  {
    "responseTime": "2ms",
    "status": "Operational",
    "type": "TSDB"
  }
]

It also returns one of the following codes.

  • 200 OK the services and API gateway are operational
  • 218 a critical service is down
  • 503 the API gateway is down

You can use the endpoint as an external health check probe.

If you prefer, you can also use apoctl to discover the status of Microsegmentation Console. For example, the following command returns status information in a table format.

apoctl api list healthchecks -o table -c type,status

A healthy Microsegmentation Console would respond with the following.

    status    |      type
--------------+------------------
  Operational | Cache
  Operational | Database
  Operational | MessagingSystem
  Operational | Service
  Operational | TSDB

Enabling additional monitoring tools

About the additional monitoring tools

After migrating to the multi-container Microsegmentation Console you can use the following monitoring tools.

  • Prometheus and Grafana to scrape and display metrics about resource usage
  • Jaeger to help you trace and debug issues with the Microsegmentation Console API

These tools are not available for single container deployments.

Enabling additional monitoring tools

The first time you create your Voila environment, the installer will ask you if you want to deploy the monitoring facilities:

[Optional] Do you want to configure the monitoring stack? (y/n) default is no: y

[Optional] What will be the public url of the metrics monitoring dashboard?
> Public Monitoring URL (example: https://monitoring.aporeto.com, leave it empty to not deploy this compoment (default): https://monitoring.aporeto.company.tld

[Optional] What will be the public url of the metrics API proxy endpoint (used for prometheus federation)?
> Public metric URL (example: https://metrics.aporeto.com, leave it empty to not deploy this compoment (default):

[Optional] Do you want to install the opentracing facility? (y/n) default is no: y

[Optional] Do you want to install the logging facility? (y/n) default is yes: y

You can modify the deployment of monitoring facilities after installation as explained below: To enable the monitoring stack:

set_value global.public.monitoring https://monitoring.microsegmentation.acme.com override

This will deploy Prometheus to scrape service data and serve it through Grafana dashboards. It also enables alerts (see below). From there you can add more facilities.

To enable logging facility (recommended):

set_value enabled true|false loki override

This will deploy/remove Loki to scrape all service logs. This is useful to report crashes if any or trouble shot issues.

To deploy a metrics proxy that can be used to remotely query alerts and metrics and perform Prometheus federation:

set_value global.public.metrics https://tracing.microsegmentation.acme.com override

To enable tracing facility (optional):

set_value enabled true|false jaeger override
set_value enabled true|false elasticsearch override

This will deploy/remove the tracing facility that will allow you look at every Microsegmentation Console API request. In general, we recommend enabling this to help during development or to analyze slow requests. This can generate several gigabytes of data every hour causing the storage to fill up very quickly.

To enable metrics proxy (optional):

set_value global.public.metrics https://metrics.microsegmentation.acme.com override

This will create service that you can use to pull metrics from outside (see below).

You can pick which facility you want to to deploy depending on what your needs are. Once you have enabled the tools that you need, issue the following command to apply the changes.

doit

Monitoring dashboards

Use the following command to obtain the monitoring dashboard URL.

get_value global.public.monitoring

To access the monitoring dashboards you need to have an auditer certificate.

Use the gen-auditer tool from your voila environment to generate a p12 certificate file in certs/auditers. You must import this to your workstation.

Dashboard metrics

Several metrics are collected using prometheus and shown as dashboards through grafana. An example follows.

Grafana

This dashboard gives you an operational overview of the platform, the load on the nodes, the storage and the current alerts and logs. This is your go to dashboard when you want to check the health of your platform.

Grafana

This dashboard gives you more details about the resource usage of nodes and services. You can use these to pinpoint any compute resource contention.

Grafana

This dashboard provides an overview of the state of the Microsegmentation Console and the general state of the component and compute resources usage. This is your second go to dashboard when you want to check the health of your platform.

Grafana

This dashboard provide a detailed view of all the microservices. You can use this mostly to debug issues and track leaks.

Grafana

This dashboard provides a Kubernetes resources allocation view. You can use it to locate resource starvation or overuse.

Grafana

This dashboard provides advanced information about MongoDB and sharding.

Grafana

This dashboard is reachable via the Explore feature on Grafana, available from the compass icon on the left. Use the top bar to select the facility you wish to explore. If you enabled the tracing facility, you can select jaeger-aporeto to see the traces.

Grafana

You can get logged in as an admin in Grafana if needed. The username is admin and you can get the password with get_value global.accounts.grafana.pass.

By default there is no data persistency on the dashboards, if you want to perform some persistent changes, you can enable the persistency by adding storage to Grafana with:

set_value storage.class <sc> grafana override

Where <sc> is the storage class of your Kubernetes cluster.

Then update the Grafana deployment with

snap -u grafana --force

Configuring alerts

Using the metrics gathered some alerts are defined to check the health of Microsegmentation Console and report issues.

Alert Description What do to
Node autoscale daily report Daily report of node scaling N/A
Service autoscale daily report Daily report of service scaling N/A
Service autoscaled limit Alert when a service reach it’s scaling limits Check resource usages
API response time increased When the response time increased on an API Nothing if it’s transient otherwise check resource contention
API response time degraded When the global response time is getting too high Nothing if it’s transient otherwise check resource contention
Service restarted When a service restarts Check the service logs via Grafana,Explore,Loki
Error 5xx detected When a service reports an error 5xx Check the service logs via Grafana,Explore,Loki
Crash detected When a service is crashing Check the service logs via Grafana,Explore,Loki
MongoDB node not responding When a database node is not responding Check status of you Kubernetes cluster and pods
MongoDB cluster degraded When the database cluster is degraded Check status of you Kubernetes cluster and pods, use mgos status from voila
MongoDB replication lag is too slow When the database is having hard time to replicate data Check the resource contention on Mongodb Nodes
Storage capacity is running low when the storage is running low Expand the storage
Certificate about to expire When the public facing cert is about to expire Renew your cert
Service is not running When a service is not starting Check the state of the pod in Kubernetes (k describe pod <podname>)
Infra service stopped responding When an infra service is not responding Check the state of the pod in Kubernetes (k describe pod <podname>)
Backend service stopped responding When a backend service is not responding Check the state of the pod in Kubernetes (k describe pod <podname>)
High Node CPU Usage When the CPU usage on node is too high Might Require to scale up your environment if it persist
High Node Memory Usage When the memory usage on nodes is too high Might Require to scale up your environment if it persist

Through the /healthchecks API you can get a summary of the current firing alerts:

[
  {
    "responseTime": "1.123ms",
    "status": "Operational",
    "type": "Cache"
  },
  {
    "responseTime": "6ms",
    "status": "Operational",
    "type": "Database"
  },
  {
    "status": "Operational",
    "type": "MessagingSystem"
  },
  {
    "alerts": [
      "1 critical active alert reported for database type."
    ],
    "name": "Monitoring",
    "status": "Degraded",
    "type": "General"
  },
  {
    "status": "Operational",
    "type": "Service"
  },
  {
    "responseTime": "7ms",
    "status": "Operational",
    "type": "TSDB"
  }
]

The lack of details in intentional as this endpoint is public.

Alerts can be sent to a Slack channel by configuring the following:

set_value global.integrations.slack.webhook "https://hooks.slack.com/services/XXX/YYY/ZZZ"
set_value global.integrations.slack.channel "#mychannel"

Then update the Prometheus deployment with

snap -u prometheus-aporeto --force

If you want to define your own alerting provider you can pass a custom AlertManager configuration as follow:

In conf.d/prometheus-aporeto/config.yaml you can define

custom:
  alerts:
    # Your alertmanager configuration goes here
    global:
      ....
  rules:
    # your prometheurs rules goes here
    groups:
    ....

Configuring a metrics proxy

About the metrics proxy

The Microsegmentation Console uses Prometheus to gather statistics on all microservices and Kubernetes endpoints. The metrics proxy allows you to expose those metrics to perform Prometheus federation for instance.

Generating a client certificate

You will need to generate a client certificate that will be used to access the Prometheus federation endpoint: Generate a client certificate with the following command:

gen-colonoscope

Now you will need to configure the client that will scrape the data from the Prometheus federation endpoint with the following parameters:

  • the metrics endpoint you set above (get_value global.public.metrics)
  • the Prometheus endpoint certificate authority (located in certs/ca-chain-public.pem) the client certificate generated above (located in certs/colonoscopes/<name>-cert.pem) the client key associated to the client certificate (located in certs/colonoscopes/<name>-key.pem)

Once done the alerts and federate endpoints will be available.

Pulling alerts

You can retrieve alerts from the following endpoint.

https://<fqdn>/alertmanager/api/v1/alerts

(where <fqdn> is what you configured as global.public.metrics).

This can be used to pull alerts as documented on Prometheus website.

Example of output:

  "status": "success",
  "data": [
    {
      "labels": {
        "alertname": "Backend service restarted",
        "color": "warning",
        "exported_pod": "canyon-75c9f966dc-g7rgj",
        "icon": ":gear:",
        "prometheus": "default/aporeto",
        "reason": "Completed",
        "recover": "false",
        "severity": "severe"
      },
      "annotations": {
        "summary": "canyon-75c9f966dc-g7rgj restarted. Reason: Completed."
      },
      "startsAt": "2020-01-22T20:59:10.521318559Z",
      "endsAt": "2020-01-22T21:02:10.521318559Z",
      "generatorURL": "http://prometheus-aporeto-0:9090/graph?g0.expr=%28%28sum+by%28exported_pod%29+%28kube_pod_container_status_restarts_total%29+-+sum+by%28exported_pod%29+%28kube_pod_container_status_restarts_total+offset+1m%29%29+%21%3D+0%29+-+on%28exported_pod%29+group_right%28%29+count+by%28exported_pod%2C+reason%29+%28kube_pod_container_status_last_terminated_reason+%3E+0%29&g0.tab=1",
      "status": {
        "state": "active",
        "silencedBy": [],
        "inhibitedBy": []
      },
      "receivers": [
        "norecover"
      ],
      "fingerprint": "5a483f5586d6de87"
    }
  ]
}

Federating Prometheus

You can use the following endpoint to federate Prometheus instances together.

https://<fqdn>/prometheus/federate

Example request:

curl -k https://<fqdn>/prometheus/federate --cert ./certs/colonoscopes/example-cert.pem --key ./certs/colonoscopes/example-key.pem -G --data-urlencode 'match[]={type=~"aporeto|database"}'

This request will pull all current metrics.

Subset of output:

http_requests_total{code="200",endpoint="health",instance="10.64.241.42:1080",job="health-cactuar",method="GET",namespace="default",pod="cactuar-5cdddc64c7-sfwp8",service="cactuar",type="aporeto",url="/oidcproviders",prometheus="default/aporeto",prometheus_replica="prometheus-aporeto-0"} 2 1579736744156
http_requests_total{code="200",endpoint="health",instance="10.64.241.42:1080",job="health-cactuar",method="POST",namespace="default",pod="cactuar-5cdddc64c7-sfwp8",service="cactuar",type="aporeto",url="/appcredentials",prometheus="default/aporeto",prometheus_replica="prometheus-aporeto-0"} 31 1579736744156
http_requests_total{code="200",endpoint="health",instance="10.64.241.42:1080",job="health-cactuar",method="POST",namespace="default",pod="cactuar-5cdddc64c7-sfwp8",service="cactuar",type="aporeto",url="/servicetoken",prometheus="default/aporeto",prometheus_replica="prometheus-aporeto-0"} 1689 1579736744156
http_requests_total{code="200",endpoint="health",instance="10.64.241.42:1080",job="health-cactuar",method="PUT",namespace="default",pod="cactuar-5cdddc64c7-sfwp8",service="cactuar",type="aporeto",url="/appcredentials/:id",prometheus="default/aporeto",prometheus_replica="prometheus-aporeto-0"} 25 1579736744156

Example of Prometheus configuration used to scrape data from the Microsegmentation Prometheus instance:

scrape_configs:
  - job_name: "federate"
    scheme: https
    scrape_interval: 15s
    tls_config:
      ca_file: path-to-ca-cert.pem
      cert_file: path-to-client-cert.pem
      key_file: path-to-client-cert-key.pem
      insecure_skip_verify: false

    honor_labels: true
    metrics_path: "/prometheus/federate"

    params:
      "match[]":
        - '{type=~"aporeto|database"}'

    static_configs:
      - targets:
          - "<fqdn>"

Checking capacity

Among all the metrics reported, some capacity metrics are also available:

# HELP aporeto_enforcers_collection_duration_seconds The enforcer count collection duration in seconds.
# TYPE aporeto_enforcers_collection_duration_seconds gauge
aporeto_enforcers_collection_duration_seconds 0.003

# HELP aporeto_enforcers_total The enforcer count metric
# TYPE aporeto_enforcers_total gauge
aporeto_enforcers_total{unreachable="false"} 0
aporeto_enforcers_total{unreachable="true"} 0

# HELP aporeto_flowreports The flowreports metric for interval
# TYPE aporeto_flowreports gauge
aporeto_flowreports{action="accept",interval="15m0s"} 0
aporeto_flowreports{action="reject",interval="15m0s"} 0

# HELP aporeto_flowreports_collection_duration_seconds The flowreports collection duration in seconds.
# TYPE aporeto_flowreports_collection_duration_seconds gauge
aporeto_flowreports_collection_duration_seconds 0.004

# HELP aporeto_namespaces_collection_duration_seconds The namespace count collection duration in seconds.
# TYPE aporeto_namespaces_collection_duration_seconds gauge
aporeto_namespaces_collection_duration_seconds 0.005

# HELP aporeto_namespaces_total The namespaces count metric
# TYPE aporeto_namespaces_total gauge
aporeto_namespaces_total 3

# HELP aporeto_policies_collection_duration_seconds The policies count collection duration in seconds.
# TYPE aporeto_policies_collection_duration_seconds gauge
aporeto_policies_collection_duration_seconds 0.005

# HELP aporeto_policies_total The policies count metric
# TYPE aporeto_policies_total gauge
aporeto_policies_total 7

# HELP aporeto_processingunits_collection_duration_seconds The processing units count collection duration in seconds.
# TYPE aporeto_processingunits_collection_duration_seconds gauge
aporeto_processingunits_collection_duration_seconds 0.004

# HELP aporeto_processingunits_total The processing units count metric
# TYPE aporeto_processingunits_total gauge
aporeto_processingunits_total 0

Those metrics are used on the operational dashboard.