timing of access to PCI PLX9054 bus interface, DMA versus CPU

Peter Apian-Bennewitz
back to my list of hardware bits , home page , Feb 2015

Hardware

PC104+ card, Mesanet Xilinx-FPGA card "4I69" with PLX9054 bus interface to 33MHz 32bit PCI
CPU: Geode LX, 500MHz

Update March 2022: The PLX9054 is no longer in production at PLX/Broadcom. An alternative chip with PCI-localbus interface is difficult to locate, as Texas and Maxim seem to focus on PCIe these days.

Software

"vanilla" kernel 3.8.8, own kernel driver and FPGA config

Description

PCI is dated on desktops, but still fairly active in embedded systems. A similar bus situation can be expected on PCIe, the difference between DMA and non-DMA is probably even larger then.
The PLX9054 (this page no longer available, March 2022) is a configurable interface chip used on PCI cards, which interfaces the PCI bus to a local bus on the card. The local bus can be either 8,16 or 32bit wide, with separated or multiplexed address and data bus lines. The PLX9054 handles bus arbitration and timing on the PCI, as well as on the local bus side, optionally endian-ness conversion and offers read-ahead FIFO, IRQ generation, mailboxes and DMA control.
This text summaries timing comparison between DMA and non-DMA mode. For Linux kernel DMA setup, consult Documentation/DMA-API-HOWTO.txt and Documentation/PCI/pci.txt in your Linux kernel source.

The following diagrams show a read of 4096bytes in 32bit words from a 32bit-local bus to kernel memory. Non-DMA reads use a readl() in a for-loop, IRQs not disabled. Two configs of the PLX9054 were used for non-DMA access: The first disables look-ahead FIFO reading from local bus, and disallows BURST timing, the second one enables both. On this board, the local bus uses multiplexed address/data signals on the 32bit lines (J-mode in PLX9054 parlance).
Scope screen-dumps show signals on the local bus, top-to-bottom: LCLK (clock on local bus, 50MHz), ADS (address-strobe, beginning of read cycle, address put on bus), BLAST (end-of-cycle) and READY (signal to PLX when data ready), all three active-low. In BURST mode, each rising edge of LCLK can read a data word at a consecutive address, following the initial setting of the start address.
units: 1MB = 1024*1024 bytes

no read-ahead, no DMA

with read-ahead, no DMA

with read-ahead, DMA

overall timing of 1k x 32bit words, 100us/div

680us , 5.7 MB/s

420us , 9.3 MB/s

33us , 118 MB/s

detailed timing, 400ns/div

Top row, left-to-right:
No read-ahead, no DMA: Reading 4096byte in 1024 32bit words to RAM takes around 680us. The gap is a random interrupt happening in between. Around 5.7 MB/s
With read-ahead, no DMA: The same 4096byte are read in around 420us. Another interrupt happening in between. Around 9.3 MB/s
With read-ahead and DMA: The same 4096byte are read in around 33us. Roughly 20 times faster than the first config. Around 118 MB/s (nominal bus capacity for a 33MHz 32bit PCI is 133MB/s)
Second row, left-to-right:
No read-ahead, no DMA: Each ADS strobe is followed quickly by BLAST: the local bus is active for one 32bit word only. Between this, the local bus and PCI bus are inactive during long intervals.
With read-ahead, no DMA: The initial read from PCI triggers a BURST read on the local bus (PCI controller pre-fetched multiple words from local bus, so longer time between ADS and BLAST), following reads from PCI result in one-word reads on local bus. Intervals between reads are shorter, since each PCI read except the first is immediately fed from the cache (FIFO) inside the PLX9054, while the next word is read from local bus in parallel.
With read-ahead and DMA: At start of DMA, a BURST read on the local bus saturates FIFO, followed by shorter BURST periods (same timing on PCI and local bus as in config before). Dead intervals between reads are much shorter than in non-DMA case.

Lessons learned

The timing on the PCI bus for DMA initiated transfers isn't the same as timing of CPU-initiated transfers:
DMA is not "same as CPU, but without the CPU". That may have been true with ISA and one central DMA controller-per-system. The PCI bus has DMA controllers on each interface card (and probably other modern buses have this feature too), and timings can be very different compared to CPU-initiated access cycles.
Kernel-wise, a transfer by DMA can be reasonable even for fairly small amounts of data (below 1k), not just for multi-megabyte data. Even if the CPU does a busy wait (udelay()) for the DMA to finish, it might be faster than multiple CPU-initiated lread(). Naturally, for larger heaps of data, the end-of-transfer should be signalled by an interrupt, adding to the speed-up between CPU and DMA.
For smaller amounts of data, the difference in timing of a user program might not be much different, since task scheduling adds delays it too. However, using DMA still saves bus bandwidth - which may be needed elsewhere. If you want to know what's happening in the system, monitor bus signals with a trustworthy scope. Timing with a user-space program (or even kernel-space) will not show the full picture.
Some questions remain: What the heck is the CPU doing in between kernel lread() ? The situation gets even worse with some modern chip-sets (e.g. on Atom boards): The delay in accessing an address on a PCI card gets even longer.
After some learning curve, DMA is rather simple and fun.

Peter Apian-Bennewitz, info[AT]pab-opto.de, text and images are under the GNU_Free_Documentation_License, reference to this text appreciated.