Have you ever noticed that if, during the copying of a file to a USB disk, you start to copy another. The performance of both drops drastically ?
A small test
Take the simple test (If you have linux)
Create a 100 mega byte file:
> dd if=/dev/zero of=100m bs=10485760 count=100
Now create two test files:
# sequential_test.sh
cp 100m /media/usb_disk/file1
sync
cp 100m /media/usb_disk/file2
sync
# parallel_test.sh
cp 100m /media/usb_disk/file3 &
cp 100m /media/usb_disk/file4 &
sync
Run the tests:
> time sh sequential_test.sh
> time sh parallel_test.sh
Performance may vary, I got the sequential_test being 3 times as fast.
Depending on fragmentation and you OS, the difference can go up much higher than that.
How USB transfers work (Stack level) ?
USB is a multi-layer protocol with multiple stages at each layer.
We will examine only the relevant culprits ignoring the less relevant ones.
Transfer a 4K block from PC to HD:
OS: Determines where on the target disk to put the block
- Mass storage: A SCSI header is created
(containing information such as size of block (4K), target block location (LBA), direction (write) and others)
– Mass storage: A Mass-Storage protocol wraps the SCSI header with its own
(holding pretty much the same info again, don’t ask why)
– Mass storage: The header from above (31bytes) is scheduled for transfer
—- USB Core: Figures out the USB address of the device and endpoint to use
—— USB Host Controller Driver: Schedules an OUT transfer of 31bytes
—— USB Host Controller HW: Sends OUT token to device
(this tells the device to expect data)
—— USB Host Controller HW: Sends 31byte packet to device
—— USB Device Controller HW: ACKs the packet
—- USB Host Controller Driver: Discovers of finished transfer
(usually via interrupt and done-list traversal)
– Mass storage: Schedules the data transfer of 4K
–*** Same transfer sequence as above ***
– Mass storage: Schedules a read of 13bytes (status header from device)
–*** Same transfer sequence as above ***
OS: Transfer complete
Resulting with 3 data transfers on the USB for a single data transfer on the OS level.
How USB transfers work (Controller level) ?
Being a Master-Slave, Shared-Bus the Host synchronizes the communication on the bus by issuing a periodic SOF (Start Of Frame) token every 1ms (1/8ms for High-speed).
This effectively divides each second into 8000 transfer windows (from now on we will we will speak of Hi-Speed mode only)
As a result controllers cannot usually add a transfer to the current window (it will be executed on next window).
Also, the ‘done-list’ interrupts and processing is done at the end of a window.
Sample 1 byte transfer sequence:
SOF0
* Controller Driver schedules a 1byte transfer
* Controller HW adds transfer to pending list
SOF1
* Controller HW executes Transfer
SOF2
* Controller Driver goes over the done list and notifies upper layer and schedules the next Transfer
From the above example, we can see that trying to sequentially send 1byte packets (ala serial line style) puts at a practical 4000bytes/sec limit.
It does go faster
Obviously USB can work much faster, how ?
The maximum size of a single bulk-transfer packet is 512bytes. A modern host controller, can usually push as many as 13 of those if not bothered by other transfers. It can also allow scheduling of large buffer chains by the driver, allowing it to fill frames.
This puts us at a possible:
8000 frames * 13 packets * 512 bytes / sec = ~50M/sec
(we managed to get to around ~38 at our labs)
Back to bulk transfers
As we have seen the Mass-Storage protocol has 3 stages: Header, Data, Status. With a minimum of 2 frames for each stage, that gets us to 6frames for our 512byte transfer.
Theoretical speed: 8000/6 * 512 = ~666K/second
It should be obvious now, that getting the Data stage as large as possible will greatly speed up our transfer. The larger the transfer stage the smaller effect all the other stages will have.
But even smaller sizes, for example 32K:
Header: 2 frames
Data: (32K / 512)/13 + 1 frames
Status: 2 frames
= 10 frames per 32K
Effective speed: (8000 / 10) * 32K = 25M/sec
(Only 2 times as slow, thats because we spend 5/10 frames for data transfer instead of 10/10)
In practice, most device controllers / drivers / hardware, take much more time to parse and setup a transfer folliwing a header. And the overhead for a data transfer can rise from the theoretical minimum of 5 to 10-20 or higher. Making small transfer sizes much more expensive.
And here is where parallelizm hits:
Trying to send two data streams at the same time can still work very fast, if the OS will be willing to submit them in large chunks. Alas, in most cases, in the name of parallel execution, the transfers might get chopped down to smaller blocks.
Granted for normal magnetic disks, the seek times can be unavoidable for parallel copy. But the real performance hit is usually with the flash devices, that while have no seek times and should be able to handle parallel transfers very fast, have slow setup for the SCSI/Mass-Storage handling. Resulting in very poor performance on small transfer blocks.