timing of access to PCI PLX9054 bus interface, DMA versus CPU
Peter Apian-Bennewitz
back to my list of hardware bits , home page , Feb 2015
Hardware
- PC104+ card, Mesanet Xilinx-FPGA card "4I69" with PLX9054 bus interface to 33MHz 32bit PCI
- CPU: Geode LX, 500MHz
Update March 2022: The PLX9054 is no longer in production at PLX/Broadcom.
An alternative chip with PCI-localbus interface is difficult to locate, as Texas and Maxim seem to focus on PCIe these days.
Software
- "vanilla" kernel 3.8.8, own kernel driver and FPGA config
Description
PCI is dated on desktops, but still fairly active in embedded systems. A similar bus situation can be expected on PCIe,
the difference between DMA and non-DMA is probably even larger then.
The PLX9054 (this page no longer available, March 2022) is a configurable
interface chip used on PCI cards, which interfaces the PCI bus to a local bus on the card. The local bus can be either 8,16 or 32bit wide,
with separated or multiplexed address and data bus lines.
The PLX9054 handles bus arbitration and timing on the PCI, as well as on the local bus side, optionally endian-ness conversion and offers
read-ahead FIFO, IRQ generation, mailboxes and DMA control.
This text summaries timing comparison between DMA and non-DMA mode. For Linux kernel DMA setup, consult
Documentation/DMA-API-HOWTO.txt and Documentation/PCI/pci.txt in your Linux kernel source.
The following diagrams show a read of 4096bytes in 32bit words from a 32bit-local bus to kernel memory. Non-DMA reads use a readl()
in a for-loop, IRQs not disabled. Two configs of the PLX9054 were used for non-DMA access: The first disables look-ahead FIFO reading from
local bus, and disallows BURST timing, the second one enables both. On this board, the local bus uses multiplexed address/data signals on the
32bit lines (J-mode in PLX9054 parlance).
Scope screen-dumps show signals on the local bus, top-to-bottom: LCLK (clock on local bus, 50MHz), ADS (address-strobe, beginning of read
cycle, address put on bus), BLAST (end-of-cycle) and READY (signal to PLX when data ready), all three active-low.
In BURST mode, each rising edge of LCLK can read a data word at a consecutive address, following the initial setting of the start address.
units: 1MB = 1024*1024 bytes
| no read-ahead, no DMA
| with read-ahead, no DMA
| with read-ahead, DMA
|
overall timing of 1k x 32bit words, 100us/div
|
|
|
|
| 680us , 5.7 MB/s
| 420us , 9.3 MB/s
| 33us , 118 MB/s
|
detailed timing, 400ns/div
|
|
|
|
Top row, left-to-right:
No read-ahead, no DMA: Reading 4096byte in 1024 32bit words to RAM takes around 680us. The gap is a random interrupt happening in
between. Around 5.7 MB/s
With read-ahead, no DMA: The same 4096byte are read in around 420us. Another interrupt happening in between. Around 9.3 MB/s
With read-ahead and DMA: The same 4096byte are read in around 33us. Roughly 20 times faster than the first config. Around 118 MB/s (nominal
bus capacity for a 33MHz 32bit PCI is 133MB/s)
Second row, left-to-right:
No read-ahead, no DMA: Each ADS strobe is followed quickly by BLAST: the local bus is active for one 32bit word only. Between this, the
local bus and PCI bus are inactive during long intervals.
With read-ahead, no DMA: The initial read from PCI triggers a BURST read on the local bus (PCI controller pre-fetched multiple words from
local bus, so longer time between ADS and BLAST), following reads from PCI result in one-word reads on local bus.
Intervals between reads are shorter, since each PCI read except the first is immediately fed from the cache (FIFO) inside the PLX9054, while
the next word is read from local bus in parallel.
With read-ahead and DMA: At start of DMA, a BURST read on the local bus saturates FIFO, followed by shorter BURST periods (same timing on
PCI and local bus as in config before). Dead intervals between reads are much shorter than in non-DMA case.
Lessons learned
- The timing on the PCI bus for DMA initiated transfers isn't the same as timing of CPU-initiated transfers:
DMA is not "same as CPU, but without the CPU". That may have been true with ISA and one central DMA controller-per-system.
The PCI bus has DMA controllers on each interface card (and probably other modern buses have this feature too), and timings can be very
different compared to CPU-initiated access cycles.
- Kernel-wise, a transfer by DMA can be reasonable even for fairly small amounts of data (below 1k), not just for multi-megabyte data.
Even if the CPU does a busy wait (udelay()) for the DMA to finish, it might be faster than multiple CPU-initiated lread().
Naturally, for larger heaps of data, the end-of-transfer should be signalled by an interrupt, adding to the speed-up between CPU and DMA.
- For smaller amounts of data, the difference in timing of a user program might not be much different, since task scheduling adds delays it too.
However, using DMA still saves bus bandwidth - which may be needed elsewhere. If you want to know what's happening in the system, monitor bus
signals with a trustworthy scope. Timing with a user-space program (or even kernel-space) will not show the full picture.
- Some questions remain: What the heck is the CPU doing in between kernel lread() ?
The situation gets even worse with some modern chip-sets (e.g. on Atom boards):
The delay in accessing an address on a PCI card gets even longer.
- After some learning curve, DMA is rather simple and fun.
Peter Apian-Bennewitz, info[AT]pab-opto.de,
text and images are under the GNU_Free_Documentation_License,
reference to this text appreciated.