From d07d5ebf1a13a9cf1093b7e7345948fbfbcadff4 Mon Sep 17 00:00:00 2001 From: Kamesh Akella Date: Wed, 2 Oct 2024 04:43:25 -0400 Subject: [PATCH] Add Health Checks for Multi-Site deployment to HA Docs Closes #33143 Signed-off-by: Kamesh Akella Signed-off-by: Alexander Schwartz Signed-off-by: Kamesh Akella Co-authored-by: Alexander Schwartz Co-authored-by: andymunro <48995441+andymunro@users.noreply.github.com> --- .../health-checks-multi-site.adoc | 113 ++++++++++++++++++ .../high-availability/introduction.adoc | 1 + docs/guides/high-availability/pinned-guides | 1 + 3 files changed, 115 insertions(+) create mode 100644 docs/guides/high-availability/health-checks-multi-site.adoc diff --git a/docs/guides/high-availability/health-checks-multi-site.adoc b/docs/guides/high-availability/health-checks-multi-site.adoc new file mode 100644 index 00000000000..01deaaf4d56 --- /dev/null +++ b/docs/guides/high-availability/health-checks-multi-site.adoc @@ -0,0 +1,113 @@ +<#import "/templates/guide.adoc" as tmpl> +<#import "/templates/links.adoc" as links> + +<@tmpl.guide +title="Health checks for multi-site deployments" +summary="Validating the health of a multi-site deployment" > + +When running the <@links.ha id="introduction" /> in a Kubernetes environment, +you should automate checks to see if everything is up and running as expected. + +This page provides an overview of URLs, +Kubernetes resources, and Healthcheck endpoints available to verify a multi-site setup of {project_name}. + +== Overview + +A proactive monitoring strategy aims to detect and alert about issues before they impact users. This strategy is the key for a highly resilient and highly available {project_name} application. + +Health checks across various architectural components (such as application health, load balancing, caching, and overall system status) are critical for: + +Ensuring high availability:: Verifying that all sites and the load balancer are operational is a key to ensure that a system can handle requests even if one site goes down. + +Maintaining performance:: Checking the health and distribution of the {jdgserver_name} cache ensures that {project_name} can maintain optimal performance by efficiently handling sessions and other temporary data. + +Operational resilience:: By continuously monitoring the health of both {project_name} and its dependencies within the Kubernetes environment, the system can quickly identify and possibly auto-remediate issues, reducing downtime. + +== Prerequisites + +. https://kubernetes.io/docs/tasks/tools/#kubectl[Kubectl CLI is installed and configured]. + +. Install https://jqlang.github.io/jq/download/[jq] if it is not already installed on your operating system. + +== Specific health checks + +=== {project_name} load balancer and sites + +Verifies the health of the {project_name} application through its load balancer and both primary and backup sites. This ensures that {project_name} is accessible and that the load balancing mechanism is functioning correctly across different geographical or network locations. + +This command returns the health status of the {project_name} application's connection to its configured database, thus confirming the reliability of database connections. +This command is available only on the management port and not from the external URL. +In a Kubernetes setup, the sub-status `health/ready` is checked periodically to make the Pod as ready. + +[source,bash] +---- +curl -s https://keycloak:managementport/health +---- + +This command verifies the `lb-check` endpoint of the load balancer and ensures the {project_name} application cluster is up and running. +[source,bash] +---- +curl -s https://keycloak-load-balancer-url/lb-check +---- + +These commands will return the running status of the Site A and Site B of the {project_name} in a multi-site setup. + +[source,bash] +---- +curl -s https://keycloak_site_a_url/lb-check +curl -s https://keycloak_site_b_url/lb-check +---- + +=== {jdgserver_name} Cache health +Check the health of the default cache manager and individual caches in an external {jdgserver_name} cluster. +This check is vital for {project_name} performance and reliability, +as {jdgserver_name} is often used for distributed caching and session clustering in {project_name} deployments. + +This command returns the overall health of the {jdgserver_name} cache manager, which is useful as the Admin user does not need to provide user credentials to get the health status. +[source,bash] +---- +curl -s https://infinispan_rest_url/rest/v2/cache-managers/default/health/status +---- + +In contrast to the preceding health checks, the following health checks require the Admin user to provide the {jdgserver_name} user credentials as part of the request to peek into the overall health of the external {jdgserver_name} cluster caches. +[source,bash] +---- +curl -u : -s https://infinispan_rest_url/rest/v2/cache-managers/default/health \ + | jq 'if .cluster_health.health_status == "HEALTHY" and (all(.cache_health[].status; . == "HEALTHY")) then "HEALTHY" else "UNHEALTHY" end' +---- + +The `jq` filter is a convenience to compute the overall health based on the individual cache health. +You can also choose to run the above command without the `jq` filter to see the full details. + +=== {jdgserver_name} Cluster distribution +Assesses the distribution health of the {jdgserver_name} cluster, ensuring that the cluster's nodes are correctly distributing data. This step is essential for the scalability and fault tolerance of the caching layer. + +You can modify the `expectedCount 3` argument to match the total nodes in the cluster and validate if they are healthy or not. +[source,bash] +---- +curl : -s https://infinispan_rest_url/rest/v2/cluster\?action\=distribution \ + | jq --argjson expectedCount 3 'if map(select(.node_addresses | length > 0)) | length == $expectedCount then "HEALTHY" else "UNHEALTHY" end' +---- + +=== Overall, {jdgserver_name} system health +Uses the `kubectl` CLI tool to query the health status of {jdgserver_name} clusters and the {project_name} service in the specified namespace. This comprehensive check ensures that all components of the {project_name} deployment are operational and correctly configured within the Kubernetes environment. + +[source,bash] +---- +kubectl get infinispan -n -o json \ +| jq '.items[].status.conditions' \ +| jq 'map({(.type): .status})' \ +| jq 'reduce .[] as $item ([]; . + [keys[] | select($item[.] != "True")]) | if length == 0 then "HEALTHY" else "UNHEALTHY: " + (join(", ")) end' +---- + +=== {project_name} readiness in Kubernetes +Specifically, checks for the readiness and rolling update conditions of {project_name} deployments in Kubernetes, +ensuring that the {project_name} instances are fully operational and not undergoing updates that could impact availability. + +[source,bash] +---- +kubectl wait --for=condition=Ready --timeout=10s keycloaks.k8s.keycloak.org/keycloak -n +kubectl wait --for=condition=RollingUpdate=False --timeout=10s keycloaks.k8s.keycloak.org/keycloak -n +---- + + diff --git a/docs/guides/high-availability/introduction.adoc b/docs/guides/high-availability/introduction.adoc index 029f18436b6..40f2117eaa6 100644 --- a/docs/guides/high-availability/introduction.adoc +++ b/docs/guides/high-availability/introduction.adoc @@ -39,6 +39,7 @@ Additional performance tuning and security hardening are still recommended when * <@links.ha id="operate-synchronize" /> * <@links.ha id="operate-site-offline" /> * <@links.ha id="operate-site-online" /> +* <@links.ha id="health-checks-multi-site" /> diff --git a/docs/guides/high-availability/pinned-guides b/docs/guides/high-availability/pinned-guides index f7378971816..01a52a59d98 100644 --- a/docs/guides/high-availability/pinned-guides +++ b/docs/guides/high-availability/pinned-guides @@ -6,6 +6,7 @@ deploy-infinispan-kubernetes-crossdc deploy-keycloak-kubernetes deploy-aws-route53-loadbalancer deploy-aws-route53-failover-lambda +health-checks-multi-site operate-failover operate-switch-over operate-network-partition-recovery