15.9. Using High-Speed Interconnects with MySQL Cluster

Even before design of NDB Cluster began in 1996, it was evident that one of the major problems to be encountered in building parallel databases would be communication between the nodes in the network. For this reason, NDB Cluster was designed from the very beginning to allow for the use of a number of different data transport mechanisms. In this Manual, we use the term transporter for these.

The MySQL Cluster codebase includes support for four different transporters:

TCP/IP using 100 Mbps or gigabit Ethernet, as discussed in Section 15.3.4.7, “Cluster TCP/IP Connections”.
Direct (machine-to-machine) TCP/IP; although this transporter uses the same TCP/IP protocol as mentioned in the previous item, it requires setting up the hardware differently and is configured differently as well. For this reason, it is considered a separate transport mechanism for MySQL Cluster. See Section 15.3.4.8, “TCP/IP Connections Using Direct Connections”, for details.
Shared memory (SHM). For more information about SHM, see Section 15.3.4.9, “Shared-Memory Connections”.
Scalable Coherent Interface (SCI), as described in the next section of this chapter, Section 15.3.4.10, “SCI Transport Connections”.

Most users today employ TCP/IP over Ethernet because it is ubiquitous. TCP/IP is also by far the best-tested transporter for use with MySQL Cluster.

We are working to make sure that communication with the ndbd process is made in “chunks” that are as large as possible because this benefits all types of data transmission.

For users who desire it, it is also possible to use cluster interconnects to enhance performance even further. There are two ways to achieve this: Either a custom transporter can be designed to handle this case, or you can use socket implementations that bypass the TCP/IP stack to one extent or another. We have experimented with both of these techniques using the SCI (Scalable Coherent Interface) technology developed by Dolphin.

15.9.1. Configuring MySQL Cluster to use SCI Sockets

In this section, we show how to adapt a cluster configured for normal TCP/IP communication to use SCI Sockets instead. This documentation is based on SCI Sockets version 2.3.0 as of 01 October 2004.

Prerequisites

Any machines with which you wish to use SCI Sockets must be equipped with SCI cards.

No special builds (other than the -max builds) are needed for SCI Sockets because it uses normal TCP/IP socket calls which are already available in MySQL Cluster. However, SCI Sockets are currently supported only on the Linux 2.4 and 2.6 kernels. For other operating systems, you can use SCI Transporters, but this requires that the server be built using --with-ndb-sci=/opt/DIS.

Prior to MySQL 5.0.44, there were issues with building MySQL Cluster with SCI support (see Bug#25470), but these have been resolved due to work contributed by Dolphin International. SCI Sockets are now correctly supported for MySQL Cluster using the -max builds, and versions of MySQL Cluster with SCI Transporter support can be built using either of compile-amd64-max-sci or compile-pentium64-max-sci. Both of these build scripts can be found in the BUILD directory of the MySQL 5.0 source; it should not be difficult to adapt them for other platforms.

There are essentially four requirements for SCI Sockets:

Building the SCI Socket libraries.
Installation of the SCI Socket kernel libraries.
Installation of one or two configuration files.
The SCI Socket kernel library must be enabled either for the entire machine or for the shell where the MySQL Cluster processes are started.

This process needs to be repeated for each machine in the cluster where you plan to use SCI Sockets for inter-node communication.

Two packages need to be retrieved to get SCI Sockets working:

The source code package containing the DIS support libraries for the SCI Sockets libraries.
The source code package for the SCI Socket libraries themselves.

Currently, these are available only in source code format. The latest versions of these packages at the time of this writing were available as (respectively) DIS_GPL_2_5_0_SEP_10_2004.tar.gz and SCI_SOCKET_2_3_0_OKT_01_2004.tar.gz. You should be able to find these (or possibly newer versions) at http://www.dolphinics.no/support/downloads.html.

Package Installation

Once you have obtained the library packages, the next step is to unpack them into appropriate directories, with the SCI Sockets library unpacked into a directory below the DIS code. Next, you need to build the libraries. This example shows the commands used on Linux/x86 to perform this task:

shell> tar xzf DIS_GPL_2_5_0_SEP_10_2004.tar.gz
shell> cd DIS_GPL_2_5_0_SEP_10_2004/src/
shell> tar xzf ../../SCI_SOCKET_2_3_0_OKT_01_2004.tar.gz
shell> cd ../adm/bin/Linux_pkgs
shell> ./make_PSB_66_release

It is possible to build these libraries for some 64-bit procesors. To build the libraries for Opteron CPUs using the 64-bit extensions, run make_PSB_66_X86_64_release rather than make_PSB_66_release. If the build is made on an Itanium machine, you should use make_PSB_66_IA64_release. The X86-64 variant should work for Intel EM64T architectures but this has not yet (to our knowledge) been tested.

Once the build process is complete, the compiled libraries will be found in a zipped tar file with a name along the lines of DIS-<operating-system>-time-date. It is now time to install the package in the proper place. In this example we will place the installation in /opt/DIS. (Note: You will most likely need to run the following as the system root user.)

shell> cp DIS_Linux_2.4.20-8_181004.tar.gz /opt/
shell> cd /opt
shell> tar xzf DIS_Linux_2.4.20-8_181004.tar.gz
shell> mv DIS_Linux_2.4.20-8_181004 DIS

Network Configuration

Now that all the libraries and binaries are in their proper place, we need to ensure that the SCI cards have proper node IDs within the SCI address space.

It is also necessary to decide on the network structure before proceeding. There are three types of network structures which can be used in this context:

A simple one-dimensional ring
One or more SCI switches with one ring per switch port
A two- or three-dimensional torus.

Each of these topologies has its own method for providing node IDs. We discuss each of them in brief.

A simple ring uses node IDs which are non-zero multiples of 4: 4, 8, 12,...

The next possibility uses SCI switches. An SCI switch has 8 ports, each of which can support a ring. It is necessary to make sure that different rings use different node ID spaces. In a typical configuration, the first port uses node IDs below 64 (4 – 60), the next 64 node IDs (68 – 124) are assigned to the next port, and so on, with node IDs 452 – 508 being assigned to the eighth port.

Two- and three-dimensional torus network structures take into account where each node is located in each dimension, incrementing by 4 for each node in the first dimension, by 64 in the second dimension, and (where applicable) by 1024 in the third dimension. See Dolphin's Web site for more thorough documentation.

In our testing we have used switches, although most large cluster installations use 2- or 3-dimensional torus structures. The advantage provided by switches is that, with dual SCI cards and dual switches, it is possible to build with relative ease a redundant network where the average failover time on the SCI network is on the order of 100 microseconds. This is supported by the SCI transporter in MySQL Cluster and is also under development for the SCI Socket implementation.

Failover for the 2D/3D torus is also possible but requires sending out new routing indexes to all nodes. However, this requires only 100 milliseconds or so to complete and should be acceptable for most high-availability cases.

By placing cluster data nodes properly within the switched architecture, it is possible to use 2 switches to build a structure whereby 16 computers can be interconnected and no single failure can hinder more than one of them. With 32 computers and 2 switches it is possible to configure the cluster in such a manner that no single failure can cause the loss of more than two nodes; in this case, it is also possible to know which pair of nodes is affected. Thus, by placing the two nodes in separate node groups, it is possible to build a “safe” MySQL Cluster installation.

To set the node ID for an SCI card use the following command in the /opt/DIS/sbin directory. In this example, -c 1 refers to the number of the SCI card (this is always 1 if there is only 1 card in the machine); -a 0 refers to adapter 0; and 68 is the node ID:

shell> ./sciconfig -c 1 -a 0 -n 68

If you have multiple SCI cards in the same machine, you can determine which card has which slot by issuing the following command (again we assume that the current working directory is /opt/DIS/sbin):

shell> ./sciconfig -c 1 -gsn

This will give you the SCI card's serial number. Then repeat this procedure with -c 2, and so on, for each card in the machine. Once you have matched each card with a slot, you can set node IDs for all cards.

After the necessary libraries and binaries are installed, and the SCI node IDs are set, the next step is to set up the mapping from hostnames (or IP addresses) to SCI node IDs. This is done in the SCI sockets configuration file, which should be saved as /etc/sci/scisock.conf. In this file, each SCI node ID is mapped through the proper SCI card to the hostname or IP address that it is to communicate with. Here is a very simple example of such a configuration file:

#host           #nodeId
alpha           8
beta            12
192.168.10.20   16

It is also possible to limit the configuration so that it applies only to a subset of the available ports for these hosts. An additional configuration file /etc/sci/scisock_opt.conf can be used to accomplish this, as shown here:

#-key                        -type        -values
EnablePortsByDefault                yes
EnablePort                  tcp           2200
DisablePort                 tcp           2201
EnablePortRange             tcp           2202 2219
DisablePortRange            tcp           2220 2231

Driver Installation

With the configuration files in place, the drivers can be installed.

First, the low-level drivers and then the SCI socket driver need to be installed:

shell> cd DIS/sbin/
shell> ./drv-install add PSB66
shell> ./scisocket-install add

If desired, the installation can be checked by invoking a script which verifies that all nodes in the SCI socket configuration files are accessible:

shell> cd /opt/DIS/sbin/
shell> ./status.sh

If you discover an error and need to change the SCI socket configuration, it is necessary to use ksocketconfig to accomplish this task:

shell> cd /opt/DIS/util
shell> ./ksocketconfig -f

Testing the Setup

To ensure that SCI sockets are actually being used, you can employ the latency_bench test program. Using this utility's server component, clients can connect to the server to test the latency of the connection. Determining whether SCI is enabled should be fairly simple from observing the latency. (Note: Before using latency_bench, it is necessary to set the LD_PRELOAD environment variable as shown later in this section.)

To set up a server, use the following:

shell> cd /opt/DIS/bin/socket
shell> ./latency_bench -server

To run a client, use latency_bench again, except this time with the -client option:

shell> cd /opt/DIS/bin/socket
shell> ./latency_bench -client server_hostname

SCI socket configuration should now be complete and MySQL Cluster ready to use both SCI Sockets and the SCI transporter (see Section 15.3.4.10, “SCI Transport Connections”).

Starting the Cluster

The next step in the process is to start MySQL Cluster. To enable usage of SCI Sockets it is necessary to set the environment variable LD_PRELOAD before starting ndbd, mysqld, and ndb_mgmd. This variable should point to the kernel library for SCI Sockets.

To start ndbd in a bash shell, do the following:

bash-shell> export LD_PRELOAD=/opt/DIS/lib/libkscisock.so
bash-shell> ndbd

In a tcsh environment the same thing can be accomplished with:

tcsh-shell> setenv LD_PRELOAD=/opt/DIS/lib/libkscisock.so
tcsh-shell> ndbd

Note: MySQL Cluster can use only the kernel variant of SCI Sockets.

15.9.2. Understanding the Impact of Cluster Interconnects

The ndbd process has a number of simple constructs which are used to access the data in a MySQL Cluster. We have created a very simple benchmark to check the performance of each of these and the effects which various interconnects have on their performance.

There are four access methods:

Primary key access
This is access of a record through its primary key. In the simplest case, only one record is accessed at a time, which means that the full cost of setting up a number of TCP/IP messages and a number of costs for context switching are borne by this single request. In the case where multiple primary key accesses are sent in one batch, those accesses share the cost of setting up the necessary TCP/IP messages and context switches. If the TCP/IP messages are for different destinations, additional TCP/IP messages need to be set up.
Unique key access
Unique key accesses are similar to primary key accesses, except that a unique key access is executed as a read on an index table followed by a primary key access on the table. However, only one request is sent from the MySQL Server, and the read of the index table is handled by ndbd. Such requests also benefit from batching.
Full table scan
When no indexes exist for a lookup on a table, a full table scan is performed. This is sent as a single request to the ndbd process, which then divides the table scan into a set of parallel scans on all cluster ndbd processes. In future versions of MySQL Cluster, an SQL node will be able to filter some of these scans.
Range scan using ordered index
When an ordered index is used, it performs a scan in the same manner as the full table scan, except that it scans only those records which are in the range used by the query transmitted by the MySQL server (SQL node). All partitions are scanned in parallel when all bound index attributes include all attributes in the partitioning key.

To check the base performance of these access methods, we have developed a set of benchmarks. One such benchmark, testReadPerf, tests simple and batched primary and unique key accesses. This benchmark also measures the setup cost of range scans by issuing scans returning a single record. There is also a variant of this benchmark which uses a range scan to fetch a batch of records.

In this way, we can determine the cost of both a single key access and a single record scan access, as well as measure the impact of the communication media used, on base access methods.

In our tests, we ran the base benchmarks for both a normal transporter using TCP/IP sockets and a similar setup using SCI sockets. The figures reported in the following table are for small accesses of 20 records per access. The difference between serial and batched access decreases by a factor of 3 to 4 when using 2KB records instead. SCI Sockets were not tested with 2KB records. Tests were performed on a cluster with 2 data nodes running on 2 dual-CPU machines equipped with AMD MP1900+ processors.

Access Type	TCP/IP Sockets	SCI Socket
Serial pk access	400 microseconds	160 microseconds
Batched pk access	28 microseconds	22 microseconds
Serial uk access	500 microseconds	250 microseconds
Batched uk access	70 microseconds	36 microseconds
Indexed eq-bound	1250 microseconds	750 microseconds
Index range	24 microseconds	12 microseconds

We also performed another set of tests to check the performance of SCI Sockets vis-а-vis that of the SCI transporter, and both of these as compared with the TCP/IP transporter. All these tests used primary key accesses either serially and multi-threaded, or multi-threaded and batched.

The tests showed that SCI sockets were about 100% faster than TCP/IP. The SCI transporter was faster in most cases compared to SCI sockets. One notable case occurred with many threads in the test program, which showed that the SCI transporter did not perform very well when used for the mysqld process.

Our overall conclusion was that, for most benchmarks, using SCI sockets improves performance by approximately 100% over TCP/IP, except in rare instances when communication performance is not an issue. This can occur when scan filters make up most of processing time or when very large batches of primary key accesses are achieved. In that case, the CPU processing in the ndbd processes becomes a fairly large part of the overhead.

Using the SCI transporter instead of SCI Sockets is only of interest in communicating between ndbd processes. Using the SCI transporter is also only of interest if a CPU can be dedicated to the ndbd process because the SCI transporter ensures that this process will never go to sleep. It is also important to ensure that the ndbd process priority is set in such a way that the process does not lose priority due to running for an extended period of time, as can be done by locking processes to CPUs in Linux 2.6. If such a configuration is possible, the ndbd process will benefit by 10–70% as compared with using SCI sockets. (The larger figures will be seen when performing updates and probably on parallel scan operations as well.)

There are several other optimized socket implementations for computer clusters, including Myrinet, Gigabit Ethernet, Infiniband and the VIA interface. We have tested MySQL Cluster so far only with SCI sockets. See Section 15.9.1, “Configuring MySQL Cluster to use SCI Sockets” for information on how to set up SCI sockets using ordinary TCP/IP for MySQL Cluster.