This is the second part of an investigation to solve a data corruption issue encountered on a Gigabyte ARM64 R120-MP31 board. In the first part I had a look at application, transport and data link layer. In this second part I perform some tests to validate possible data corruption happening in system RAM

The hardware-software interface

When an incoming frame is received on the 10GbE interface of the XGene-1, the controller is capable of mastering the bus and copying directly the data into system memory. The controller maintains a hardware ring buffer of available DMAable memory regions where to copy incoming frames. When the NIC runs out of regions, the hardware ring buffer is refilled by the driver. The DMAable addresses are basically sk_buff allocated with netdev_alloc_skb_ip_align. This function allocates a virtual addresses that is immediately mapped to a physical region. When user space processes allocate memory via malloc, the underlying brk or mmap syscalls add a particular mapping to the virtual address space of the process but a physical frame is normally not reserved until the first page fault. In this case, however, the newly allocated address must be passed over to the hardware which accesses system memory without going through the CPU MMU, making it necessary to have a mapping immediately available.

Hardware devices can be restricted from DMA-ing directly to physical addresses. An IOMMU device might interpose between the device and memory, performing address transation between the two. IOMMU will not allow the device to access memory regions which have not been allocated for its I/O, preventing potentially compromised hardware from tampering with the state of the whole system. The kernel allows to obtain a valid DMAble address for the device via the DMA API. In this case dma_map_single is used.

Retrieving frames from system RAM

In the xgene-enet driver, the function responsible for retrieving frames that have been DMAed to memory is xgene_enet_rx_frame. This function is called by the NAPI polling callback, xgene_enet_napi registered by the driver upon initialization and it is basically responsible for the following operations:

it validates the incoming sk_buff checking for hardware I/O errors
it strips off the CRC
it disables TCP checksum validation if already performed by the hardware
it updates RX counters
it passes the sk_buff to the upper layers of the stack via napi_gro_receive

By invoking the GRO receive function, the driver makes use of the Generic Receive Offload capabilities provided by the kernel that allow to merge TCP segments into single sk_buff. GRO is the receive counterpart of tcp-segmentation-offload, a feature of ethtool-enabled hardware that performs hardware segmentation of outgoing TCP segments. Both on the receive and transmit side, segmentation allows to send fewer sk_buff through the network stack, with a significant increase in performance while still transmitting on the wire chunks of data sized in a way that can be easily handled by routers, switches, etc. The following is an brief example of the initial control path for incoming frames obtained with ftrace.

 2)               |  xgene_enet_napi() {
 2)               |    xgene_enet_process_ring() {
 2)               |      xgene_enet_rx_frame() {
 2)   0.220 us    |        __swiotlb_unmap_page();
 2)   0.120 us    |        skb_put();
 2)   0.220 us    |        eth_type_trans();
 2)               |        napi_gro_receive() {
 2)   0.140 us    |          skb_gro_reset_offset();
 2)               |          dev_gro_receive() {
 2)               |            inet_gro_receive() {
 2)               |              tcp4_gro_receive() {
 2)   0.240 us    |                tcp_gro_receive();
 2)   1.440 us    |              }
 2)   2.740 us    |            }
 2)   4.180 us    |          }
 2)               |          netif_receive_skb_internal() {
[...]

As already mentioned, the DMAable addresses that are passed to the hardware ring buffers point directly to the data field of the sk_buffs. A further hypothesis I wanted to validate was whether corruption was happening when data was DMAed and subsequently read from memory due to faulty RAM (e.g. flipped bits that ECC checks could not correct).

Validating frames after DMA transfers

In order to check whether xgene_enet_rx_frame was receiving data already corrupted from system memory, I wrote some code that would perform the following steps:

Trap xgene_enet_rx_frame
Calculate the CRC and compare it with the one in the FCS field of the frame
Print the physical address of the frame upon detection of a mismatch in order to spot possible patterns or recurrent memory areas.

The implementation is based on a kernel jprobe, a Linux feature that allows to assign a callback to a kernel function with the capability of inspecting the function’s arguments. At the time of testing, the latest kernel version available (4.7.0) was not supporting officially jprobes for ARM64. Several implementations had already been circulated in the kernel mailing list, the latest one being from Sandeepa Prabhu on the 8th of July 2016 (arm64: Kprobes with single stepping support). This series of 10 patches cleanly applied against kernel 4.6.0 (aka 2dcd0af5), which is the one I used for this experiment. As a side note, I had to disable CONFIG_ARM64_ERRATUM_843419 in the kernel configuration to work around a relocation error (“unsupported RELA”) that was being raised when loading the module.

Results from the probe

What immediately stands out when running the jprobe is that the code is definitely not optimized for speed. When loading the kernel module and transferring data over the SFP+ interface, the softirq which is running the NAPI handler goes 100\% CPU utilization and the throughput drops to a bare ~8MB/s. Nonetheless, the jprobe does its job: after having transferred around 30 GB of data coming from /dev/zero, there were 69 sk_buff for which the CRC could not be validated:

[1513584.424677] Calculated CRC is 50b477c,  CRC in frame is 5ccfcebe, phys: 0000008051da2c02, 0000008051da2c02
[1513628.119962] Calculated CRC is fd4e97fc,  CRC in frame is e34ba66c, phys: 000000805e182382, 000000805e182382
[1513656.995813] Calculated CRC is 8c725bf1,  CRC in frame is 12953b1d, phys: 000000804c145682, 000000804c145682
[1513677.473247] Calculated CRC is 22665372,  CRC in frame is 12bfc315, phys: 000000804c7a7002, 000000804c7a7002
[...]
[1513685.219367] Calculated CRC is 4ad9905d,  CRC in frame is 424e3943, phys: 000000804c145f02, 000000804c145f02

Considering that the system is running with 64K pages, the following statistics can be drafted:

29 pages are interested, most of them in the area 0x804c14-0x805f18. This is a range of +1300 pages, ~80MiB.
Among the 29 pages, three of them are located in a lower memory area (0xe006, 0xe808, 0xec27).

      1 00000000e006          2 00000080563c          2 000000805da1
      1 00000000e808          3 000000805655          2 000000805e09
      1 00000000ec27          3 000000805657          3 000000805e0a
      4 000000804c14          3 00000080565b          1 000000805e0d
      4 000000804c7a          1 00000080565d          7 000000805e18
      4 0000008051da          1 00000080565f          1 000000805e1b
      2 0000008051dd          3 000000805682          2 000000805e1d
      2 000000805204          4 000000805691          1 000000805e1f
      1 000000805626          2 000000805695          2 000000805f18
      2 000000805636          5 000000805887

Now, the only way to rule out any possible memory corruption issue would be an in-depth memory test pointed directly on those memory regions. memtester is actually capable of working directly on a range of physical addresses, by making use of /dev/mem, but this is really a bad idea as the underlying memory pages are very likely to be in use by the kernel. In the best case, the tool would mistakenly report data corruption due to the concurrent activity of other threads, in the worst case, the system would completely freeze. The proper course of action would be to use bare-metal memory testing utilities like uboot mtest. At this point however, I decided to halt all the debugging activities as the memory corruption hypothesis seemed rather weak to me and the deadline I had set for coming up with a solution had been reached. Time came to ask for support to the system integrator that supplied the systems

Conclusions

After lengthy discussions with the system integrator, we were advised to try using optical transceivers (GBIC) for fiber cables rather than passive Direct Attached Copper. With some initial disappointment, this approach worked. The system was running fine at 10 Gbit/s with no data corruption whatsoever. Being used to passive copper, this idea unfortunately did not cross my mind. The optical transceiver basically handles the generation of light signals over the fiber: for some reason, the PHY embedded on the Gigabyte board does not cope well with copper, and by shifting the signal handling responsibilities to an external module, the problem seems to be “patched”. However, all the issues related to TCP checksum and Ethernet FCS are still relevant and for these to date unfortunately I don’t have an explanation . The optical transceiver does not add any layer with additional checksumming, therefore data corruption on the wire would still pose a serious problem.