Design considerations for a remote site

Deployment of an edge cluster managed from a single central place starts with a proper planning. This section provides recommendations on how to approach the deployment design.

Compute nodes aggregation into availability zones

Mirantis recommends organizing nodes in each remote site into separate Availability Zones in the MOSK Compute (OpenStack Nova), Networking (OpenStack Neutron), and Block Storage (OpenStack Cinder) services. This enables the cloud users to be aware of the failure domain represented by a remote site and distribute the parts of their applications accordingly.

Storage

Typically, high latency in between the central control plane and remote sites makes it not feasible to rely on Ceph as a storage for the instance root/ephemeral and block data.

Mirantis recommends that you configure the remote sites to use the following backends:

  • Local storage (LVM or QCOW2) as a storage backend for the MOSK Compute service. See images-storage-back-end for the configuration details.

  • LVM on iSCSI backend for the MOSK Block Storage service. See Enable LVM block storage for the enablement procedure.

To maintain the small size of a remote site, the compute nodes need to be hyper-converged and combine the compute and block storage functions.

Site sizing

There is no limitation on the number of the remote sites and their size. However, when planning the cluster, ensure consistency between the total number of nodes managed by a single control plane and the value of the size parameter set in the OpenStackDeployment custom resource. For the list of supported sizes, refer to Main elements.

Additionally, the sizing of the remote site needs to take into account the characteristics of the networking channel with the main site.

Typically, an edge site consists of 3-7 compute nodes installed in a single, usually rented, rack.

Network latency and bandwidth

Mirantis recommends keeping the network latency between the main and remote sites as low as possible. For stable interoperability of cluster components, the latency needs to be around 30-70 milliseconds. Though, depending on the cluster configuration and dynamism of the workloads running in the remote site, the stability of the cluster can be preserved with the latency of up to 190 milliseconds.

The bandwidth of the communication channel between the main and remote sites needs to be sufficient to run the following traffic:

  • The control plane and management traffic, such as OpenStack messaging, database access, MOSK underlay Kubernetes cluster control plane, and so on. A single remote compute node in the idle state requires at minimum 1.5 Mbit/s of bandwidth to perform the non-data plane communications.

  • The data plane traffic, such as OpenStack image operations, instances VNC console traffic, and so on, that heavily depend on the profile of the workloads and other aspects of the cloud usage.

In general, Mirantis recommends having a minimum of 100 MBit/s bandwidth between the main and remote sites.

Loss of connectivity to the central site

MOSK remote compute nodes architecture is designed to tolerate a temporary loss of connectivity between the main cluster and the remote sites. In case of a disconnection, the instances running on remote compute nodes will keep running normally preserving their ability to read and write ephemeral and block storage data presuming it is located in the same site, as well as connectivity to their neighbours and edge application users. However, the instances will not have access to any cloud services or applications located outside of their remote site.

Since the MOSK control plane communicates with remote compute nodes through the same network channel, cloud users will not be able to perform any manipulations, for example, instance creation, deletion, snapshotting, and so on, over their edge applications until the connectivity gets restored. MOSK services providing high availability to cloud applications, such as the Instance HA service and Network service, need to be connected to the remote compute nodes to perform a failover of application components running in the remote site.

Once the connectivity between the main and the remote site restores, all functions become available again. The period during which an edge application can sustain normal function after a connectivity loss is determined by multiple factors including the selected networking backend for the MOSK cluster. Mirantis recommends that a cloud operator performs a set of test manipulations over the cloud resources hosted in the remote site to ensure that it has been fully restored.

Long-lived graceful restart in Tungsten Fabric

When configured in Tungsten Fabric-powered clouds, the Graceful restart and long-lived graceful restart feature significantly improves the MOSK ability to sustain the connectivity of workloads running at remote sites in situations when a site experiences a loss of connection to the central hosting location of the control plane.

Extensive testing has demonstrated that remote sites can effectively withstand a 72-hour control plane disconnection with zero impact on the running applications.

Security of cross-site communication

Given that a remote site communicates with its main MOSK cluster across a wide area network (WAN), it becomes important to protect sensitive data from being intercepted and viewed by a third party. Specifically, you should ensure the protection of the data belonging to the following cloud components:

  • Mirantis Container Cloud life-cycle management plane

    Bare metal servers provisioning and control, Kubernetes cluster deployment and management, Mirantis StackLight telemetry

  • MOSK control plane

    Communication between the components of OpenStack, Tungsten Fabric, and Mirantis Ceph

  • MOSK data plane

    Cloud application traffic

The most reliable way to protect the data is to configure the network equipment in the data center and the remote site to encapsulate all the bypassing remote-to-main communications into an encrypted VPN tunnel. Alternatively, Mirantis Container Cloud and MOSK can be configured to force encryption of specific types of network traffic, such as:

  • Kubernetes networking for MOSK underlying Kubernetes cluster that handles the vast majority of in-MOSK communications

  • OpenStack tenant networking that carries all the cloud application traffic

The ability to enforce traffic encryption depends on the specific version of the Mirantis Container Cloud and MOSK in use, as well as the selected SDN backend for OpenStack.