You are on page 1of 12

[crossbow-discuss] transmit side hashing for scaling?

Hi,
Can somebody shed some light on how crossbow hashes outgoing
packets to different transmit rings (not ring groups)?
My 10GbE driver has multiple rings (and a single group). Each
transmit ring shares an interrupt with a corresponding receive
ring. We call a set of 1 TX ring, 1 RX ring, and interrupt handler
state a "slice". Transmit completions are handled from the interrupt
handler.
On OSes which support multiple transmit routes,
we''ve found that ensuring that a particular connection is always
hashed to the same slice by the host and the NIC helps quite a bit
with performance (improves CPU locality, reduces cache misses, decreases
power consumption).
Some OSes (like FreeBSD) allow a driver to assist in tagging a
connection so as to ensure that it is easy to hash
traffic for the same connection into the same slice in the host
and the NIC. Others (Linux, S10) allow the driver to hash the
outgoing packets to provide this locality.
So.. Where is the transmit hashing done in crossbow? Is it tunable?
Is there a hook where I can do provide a hash routine (like Linux)?
Can I tag packets (like FreeBSD)? Is it at least something standard
like Toeplitz?
Drew

Nitin Hande
If your driver has advertised multiple tx rings, then look for
mac_tx_fanout_mode() which in turn computes the hash on fanout hint
passed from ip. Providing hooks for additional hash routines has been
suggested.
Nitin

Andrew Gallatin
2009-May-13 17:27 UTC
head link
I guess my best bet might be to lie, and say I have only one TX
ring, then fanout things myself, like I used to before Crossbow.
Is there any non-obvious disadvantage to that?
When looking at this, I noticed mac_tx_serializer_mode(). Am I
reading
this right, in that is serializes a single queue? That seems
lacking,
compared to the nxge_serialize stuff it replaces.
Drew

rajagopal kunhappan
2009-May-13 18:07 UTC

I don''t know if this is non-obvious but exposing


layer
would help in:
1) virtualization like VNICs getting their own Tx
2) Flow control when multiple Tx rings is present
for
UDP). If a Tx ring is blocked (out of desc), then
sending
on that Tx rings gets blocked.

Tx rings to mac
rings.
(presently done
only conn_t''s

We want all the NICs to expose their Tx rings to the MAC layer.
> When looking at this, I noticed
mac_tx_serializer_mode(). Am I reading
> this right, in that is serializes a single queue?
That seems lacking,
> compared to the nxge_serialize stuff it replaces.
It is a generic solution for NICs that do not have good locking
on the
Tx side. mac_tx_serializer_mode() is used when you have a single
Tx
ring. nxge would not use that mode. It exposes multiple Tx rings.
When
multiple Tx rings are present, mac_tx_fanout_mode() is used.
mac_tx_fanout_mode() can operate in serialized mode also in which
case
there would be a serializer (soft ring) for each Tx ring. Nxge
uses that
mode.
-krgopi

Nitin Hande
2009-May-13 18:33 UTC
head link
If you advertise a single ring, then the tx path will end up in
mac_tx_single_ring_mode() , they way it does for an e1000g
driver. I
think in that case the entry point in the driver is through older
xxx_m_tx(), you may have to pay attention to that in your driver.
There
could be slight variance in both the schemes. In case of
single_ring_mode(), if you get backpressured from the driver on
the tx
side due to lack of descriptors, then packets will be enqueued at
the tx
srs. At that point, if there are multiple threads trying to send
additional packets, all the packets will end up getting queued,
whereas
there will be only one worker thread trying to clear up the queue
build-up. At high packet rates its difficult for this one thread
to
catch up (Additionally also look at MAC_DROP_ON_NO_DESC flag in

.mac_tx_srs_no_desc() which can drop the packets rather than


queuing).
Versus in mac_tx_fanout_mode() each tx ring gets its own softring
in
case of backpressure and its own worker thread.
> When looking at this, I noticed mac_tx_serializer_mode(). Am I
reading
> this right, in that is serializes a single queue? That seems
lacking,
> compared to the nxge_serialize stuff it replaces.
Yes. This part was done for nxge and as far as I remember recent
performance of this scheme was very close to that of the
previous
scheme. I think Gopi can comment more on this. What part do you
think
is missing here ?
Nitin

Andrew Gallatin
2009-May-14 16:51 UTC
> Yes. This part was done for nxge and as far as
I remember recent
> performance of this scheme was very close to
that of the previous
> scheme. I think Gopi can comment more on this.
What part do you think
> is missing here ?
Perhaps I''m missing something.. Doesn''t nxge support
multiple
TX rings?
If so, does the existing serialization serialize all traffic
to a
single ring, or is mac_tx_serializer_mode() applied after
mac_tx_fanout_mode()?
I had thought the original nxge serializer serialized each TX
ring
separately in nxge. The fork I made of it for myri10ge
certainly
works that way.
Drew

rajagopal kunhappan
2009-May-14 17:20 UTC
> Perhaps I''m missing something..
multiple TX rings?
yes.

Doesn''t nxge support

> If so, does the existing serialization


serialize all traffic to a
> single ring, or is mac_tx_serializer_mode()
applied after
> mac_tx_fanout_mode()?
Neither.
Cutting and pasting from my previous reply:
...
mac_tx_serializer_mode() is used when you have a single
Tx ring. nxge
would not use that mode. It exposes multiple Tx rings.
When multiple Tx
rings are present, mac_tx_fanout_mode() is used.
mac_tx_fanout_mode()
can operate in serialized mode also in which case there
would be a
serializer (soft ring) for each Tx ring. Nxge uses that
mode.

Nicolas Droux
2009-May-14 17:59 UTC
The serializer is only for use by the nxge driver which
has an
inefficient TX path locking implementation. We didn''t
have the
resources to completely rewrite the nxge transmit path as
part of the
Crossbow project so we moved the serialization
implementation in MAC
for that driver. The serializer in MAC does serialization
on a perring basis. The serializer should not be used by any
other driver.
You don''t have to use the serializer to support multiple
TX rings.
Keep your TX path lean and mean, apply good design
principles, e.g.
avoid holding locks for too long on your data-path, and
you should be
fine.
Nicolas.
nicolas.droux at sun.com - http://blogs.sun.com/droux

Andrew Gallatin
2009-May-14 18:34 UTC
krgopi said, in an earlier reply
"mac_tx_serializer_mode() is used when
you have a single Tx ring. nxge would not use that
mode". So I''m

confused. From the source, it looks like nxge is


using that mode
(MAC_VIRT_SERIALIZE |''ed into mi_v12n_level).
So I guess it is restricted to using only one of its
hw tx rings, then?
> You don''t have to use the serializer
to support multiple TX rings.
Keep
> your TX path lean and mean, apply good
design principles, e.g. avoid
> holding locks for too long on your
data-path, and you should be fine.
FWIW. my tx path is very "lean and mean". The only
time
locks are held are when writing the tx descriptors to
the NIC,
and when allocating a dma handle from a pre-allocated
per-ring pool.
I thought the serializer was silly too, but PAE
claimed a speedup
from it. I think that PAE claimed the speedup came
from
never back-pressuring the stack when the host overran
the
NIC. One of the "features" of the serializer was to
always
block the calling thread if the tx queue was
exhausted.
Have you done any packets-per-second benchmarks with
your
fanout code? I''m concerned that its very cache
unfriendly
if you have a long run of packets all going to the
same
destination. This is because you walk the mblk chain,
reading
the packet headers and queue up a big chain. If the
chain
gets too long, the mblk and/or the packet headers
will be
pushed out of cache by the time they make it to the
driver''s
xmit routine. So in this case you could have twice
as many
cache misses as normal when things get really backed
up.
Last, you (or somebody) mentioned there was interest
in adding
a hook for a driver to do fanout. Is there a bugid
or something
for this?

Drew

rajagopal kunhappan
Hi Andrew,
I think the confusion is in the name.
mac_tx_serializer_mode() is used
when you have single ring. nxge exposes multiple
rings. When multiple Tx
rings are present, mac_tx_fanout_mode() get
called. In this mode, each
Tx ring will have a soft ring associated with it.
The soft rings
themselves are stored in the Tx SRS. The packets
coming into
mac_tx_fanout_mode() will get fanned out to one
of the Tx soft rings and
mac_tx_soft_ring_process() gets called.
mac_tx_soft_ring_process() can
either queue up the packets or send directly to
the NIC. In the case of
nxge, packets get queued up in the soft ring and
the soft ring worker
thread sends them to the NIC.
Hope this clarifies.
With nxge we get line rate with MTU sized packets
with 8 Tx rings. The
numbers are similar to what is was with nxge
serializer in place.
> if you have a long run of packets
all going to the same
> destination. This is because you
walk the mblk chain, reading
> the packet headers and queue up a
big chain. If the chain
> gets too long, the mblk and/or the
packet headers will be
> pushed out of cache by the time
they make it to the driver''s
> xmit routine. So in this case you
could have twice as many
> cache misses as normal when things
get really backed up.
We would like to have the drivers operate in nonserialized mode. But if
for whatever reason, you want to use serialized
mode, and there are
issues, we can look into that,
Thanks,
-krgopi
--

Andrew Gallatin
The only reason I care about the serializer
is the pre-crossbow
feedback from PAE that the original
serializer avoided
putting backpressure on the stack when the TX
rings fill up.
I''m happy using the normal fanout (with some
caveats below) as
long as PAE doesn''t complain about it later.
The caveats being that I want an fanout mode
that uses a
standard Toeplitz hash so as to maintain CPU
locality.
Or a hook so I can implement my own tx side
hashing.
Drew

rajagopal kunhappan
2009-May-14 21:38 UTC
> The only reason I care about the
serializer is the pre-crossbow
> feedback from PAE that the original
serializer avoided
> putting backpressure on the stack when
the TX rings fill up.
Yes, pre-crossbow putting back pressure
would mean queue''ing up packets
in DLD. Thus all packets get queued in
DLD until the driver relieved the
flow control. By then thousands of
packets would be sitting in DLD (This
is because TCP does not check for STREAM
QFULL condition on the DLD
write queue and keeps on sending
packets). After flow control is
relieved, the queued up packets are
drained by dld_wsrv() (in single
threaded mode). Single thread is no good
on a 10gig link and thus caused
performance issues.
> I''m happy using the normal
fanout (with some caveats
below) as
> long as PAE doesn''t
complain about it later.
>
> The caveats being that I
want an fanout mode that uses
a
> standard Toeplitz hash so
as to maintain CPU locality.

I am curious as to how you maintain CPU


locality for Tx traffic. Can you
give some details?
On Solaris stack, if you have a bunch of
say TCP connections sending
traffic, they can come from any CPU on
the system. By this I mean what
CPU an application runs on is completely
random unless you do CPU binding.
I can see tying Rx traffic to a specific
Rx ring and CPU. If it is a
forwarding case, then one can tie an Rx
ring to a Tx ring.
Thanks,
-krgopi

Andrew Gallatin
2009-May-15 13:32 UTC
head link
And crossbow addresses this?
>> I''m happy using the
normal fanout (with some
caveats below) as
>> long as PAE doesn''t
complain about it later.
>>
>> The caveats being that
I want an fanout mode
that uses a
>> standard Toeplitz hash
so as to maintain CPU
locality.
>
> I am curious as to how
you maintain CPU locality
for Tx traffic. Can you
> give some details?
>
> On Solaris stack, if
you have a bunch of say
TCP connections sending
> traffic, they can come
from any CPU on the
system. By this I mean
what
> CPU an application runs
on is completely random
unless you do CPU
binding.

>
> I can see tying Rx
traffic to a specific Rx
ring and CPU. If it is a
> forwarding case, then
one can tie an Rx ring to
a Tx ring.

On OSes other than windows, this


helps mainly on the TCP receive side,
in that acks will flow out the same
CPU that handled the receive
(assuming a direct dispatch from ISR
through to the TCP/IP stack).
AFAIK, only Windows can really
control affinity to a fine level,
since
they require a toeplitz hash, and you
must provide hooks to use their
"key", and to update your indirection
table. This means they can
control affinities for connections
(or at least sets of connections)
and update them on the fly to match
the application''s affinity.
According to our Windows guy, they
really use this stuff.
But all of this depends on the OS and
the NIC agreeing on the hash.
Is there any reason (patents?
complexity? perception that the
windows solution is inferior?) that
crossbow does not try to take
the windows approach? Essentially
all NICs available today that support
multiple RX queues also support all
this other stuff that
Windows requires. Why not take
advantage of it?
Drew

Paul Durrant
2009-May-15 13:43 UTC
Quite. I asked this question well
over a year ago and never got an
answer.
Windows receive-side scaling
works well and, because the hash
is cached
in the stack''s connection data
structure, and passed down the TX
side
no

s/w calculation of the hash is


required.
Paul

Andrew Gallatin
2009-May-15 13:55 UTC
And if your goal is to just
avoid tx hashing, FreeBSD
does that now.
It has no fine grained
affinity control, but it does
cache the
hash in the stack, so no
expensive tx hashing is
required.
The changes in FreeBSD are
trivial, compared to TX
hashing. See
for example:
http://svn.freebsd.org/viewvc
/base?view=revision&revis
ion=190880
After this, packets come from
the stack with m>m_pkthdr.flowid
set, and all you need to do
is mask based on the flowid
to pick
a tx queue
Drew

Nicolas Droux
2009-May-15 18:21 UTC
We currently do a hashing
on the connection
structure address which
is
not as expensive as
parsing and hashing the
headers on a per-packet
basis. But still, we are
already working on
changes that will allow
the TX ring to be
selected by the MAC layer
through a handle which
will avoid that remaining
hash operation.
Nicolas.

-Nicolas Droux - Solaris


Kernel Networking - Sun
Microsystems, Inc.
nicolas.droux at sun.com
http://blogs.sun.com/drou
x

rajagopal kunhappan
2009-May-15 18:59 UTC
> And crossbow addresses this?
Yes. Each Tx ring has its own
queue (soft ring). If a Tx ring
is flow
controlled, the packets gets
backed up in the soft ring
associated with
that Tx ring. Thus other Tx rings
can continue to send out traffic.
> But all of this
depends on the OS and
the NIC agreeing on
the hash.
> Is there any reason
(patents?
complexity?
perception that the
> windows solution is
inferior?) that
crossbow does not try
to take
> the windows
approach?
Essentially all NICs
available today that
support
> multiple RX queues
also support all this
other stuff that
> Windows requires.
Why not take
advantage of it?
We take advantage of this though
not through Toeplitz hash.
There are some things that are
missing like being able to
retarget an
MSI-x interrupt to a CPU of our
choice. Work is underway to have
APIs to

do this. Once we have this, we


can have the poll thread run on
the same
CPU as the MSI-x interrupt that
is associated with an Rx ring. We
can
further align other threads that
take part in processing the
incoming Rx
traffic to use CPUs that belong
to the same socket (same socket
meaning
CPUs sharing common l2 cache).
-krgopi
--

You might also like