Commit graph

192 commits

Author SHA1 Message Date
Jason A. Donenfeld c69481f1b3 device: disable waitpool tests
This code is stable, and the test is finicky, especially on high core
count systems, so just disable it.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-22 15:26:47 +01:00
Jason A. Donenfeld 8bf4204d2e global: stop using ioutil
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-17 22:19:27 +01:00
Jason A. Donenfeld 4e439ea10e conn: bump to 1.16 and get rid of NetErrClosed hack
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-16 21:05:25 +01:00
Jason A. Donenfeld c7b7998619 device: remove old version file
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-12 17:59:50 +01:00
Jason A. Donenfeld 75e6d810ed device: use container/list instead of open coding it
This linked list implementation is awful, but maybe Go 2 will help
eventually, and at least we're not open coding the hlist any more.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-10 18:19:11 +01:00
Jason A. Donenfeld 747f5440bc device: retry Up() in up/down test
We're loosing our ownership of the port when bringing the device down,
which means another test process could reclaim it. Avoid this by
retrying for 4 seconds.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-10 01:01:37 +01:00
Jason A. Donenfeld 484a9fd324 device: flush peer queues before starting device
In case some old packets snuck in there before, this flushes before
starting afresh.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-10 00:39:28 +01:00
Jason A. Donenfeld 5bf8d73127 device: create peer queues at peer creation time
Rather than racing with Start(), since we're never destroying these
queues, we just set the variables at creation time.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-10 00:21:12 +01:00
Jason A. Donenfeld 587a2b2a20 device: return error from Up() and Down()
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-10 00:12:23 +01:00
Jason A. Donenfeld 6f08a10041 rwcancel: add an explicit close call
This lets us collect FDs even if the GC doesn't do it for us.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-09 20:19:14 +01:00
Jason A. Donenfeld da32fe328b device: handshake routine writes into encryption queue
Since RoutineHandshake calls peer.SendKeepalive(), it potentially is a
writer into the encryption queue, so we need to bump the wg count.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-09 19:26:45 +01:00
Josh Bleecher Snyder 4eab21a7b7 device: make RoutineReadFromTUN keep encryption queue alive
RoutineReadFromTUN can trigger a call to SendStagedPackets.
SendStagedPackets attempts to protect against sending
on the encryption queue by checking peer.isRunning and device.isClosed.
However, those are subject to TOCTOU bugs.

If that happens, we get this:

goroutine 1254 [running]:
golang.zx2c4.com/wireguard/device.(*Peer).SendStagedPackets(0xc000798300)
        .../wireguard-go/device/send.go:321 +0x125
golang.zx2c4.com/wireguard/device.(*Device).RoutineReadFromTUN(0xc000014780)
        .../wireguard-go/device/send.go:271 +0x21c
created by golang.zx2c4.com/wireguard/device.NewDevice
        .../wireguard-go/device/device.go:315 +0x298

Fix this with a simple, big hammer: Keep the encryption queue
alive as long as it might be written to.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-09 09:53:00 -08:00
Josh Bleecher Snyder 78ebce6932 device: only allocate peer queues once
This serves two purposes.

First, it makes repeatedly stopping then starting a peer cheaper.
Second, it prevents a data race observed accessing the queues.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-09 18:33:48 +01:00
Josh Bleecher Snyder cae090d116 device: clarify device.state.state docs (again)
Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-09 18:29:01 +01:00
Josh Bleecher Snyder 465261310b device: run fewer iterations in TestUpDown
The high iteration count was useful when TestUpDown
was the nexus of new bugs to investigate.

Now that it has stabilized, that's less valuable.
And it slows down running the tests and crowds out other tests.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-09 18:28:59 +01:00
Josh Bleecher Snyder d117d42ae7 device: run fewer trials in TestWaitPool when race detector enabled
On a many-core machine with the race detector enabled,
this test can take several minutes to complete.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-09 18:28:58 +01:00
Josh Bleecher Snyder ecceaadd16 device: remove nil elem check in finalizers
This is not necessary, and removing it speeds up detection of UAF bugs.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-09 18:28:55 +01:00
Jason A. Donenfeld 9e728c2eb0 device: rename unsafeRemovePeer to removePeerLocked
This matches the new naming scheme of upLocked and downLocked.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-09 16:11:33 +01:00
Jason A. Donenfeld eaf664e4e9 device: remove deviceStateNew
It's never used and we won't have a use for it. Also, move to go-running
stringer, for those without GOPATHs.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-09 15:39:19 +01:00
Jason A. Donenfeld a816e8511e device: fix comment typo and shorten state.mu.Lock to state.Lock
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-09 15:37:04 +01:00
Jason A. Donenfeld 02138f1f81 device: fix typo in comment
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-09 15:37:04 +01:00
Jason A. Donenfeld d7bc7508e5 device: fix alignment on 32-bit machines and test for it
The test previously checked the offset within a substruct, not the
offset within the allocated struct, so this adds the two together.

It then fixes an alignment crash on 32-bit machines.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-09 15:37:04 +01:00
Jason A. Donenfeld d6e76fdbd6 device: do not log on idempotent device state change
Part of being actually idempotent is that we shouldn't penalize code
that takes advantage of this property with a log splat.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-09 15:37:04 +01:00
Jason A. Donenfeld 6ac1240821 device: do not attach finalizer to non-returned object
Before, the code attached a finalizer to an object that wasn't returned,
resulting in immediate garbage collection. Instead return the actual
pointer.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-09 15:37:04 +01:00
Jason A. Donenfeld 4b5d15ec2b device: lock elem in autodraining queue before freeing
Without this, we wind up freeing packets that the encryption/decryption
queues still have, resulting in a UaF.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-09 15:37:04 +01:00
Jason A. Donenfeld 6548a682a9 device: remove listen port race in tests
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-09 15:37:04 +01:00
Jason A. Donenfeld a60e6dab76 device: generate test keys on the fly
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-09 00:42:39 +01:00
Josh Bleecher Snyder d8dd1f254f device: remove mutex from Peer send/receive
The immediate motivation for this change is an observed deadlock.

1. A goroutine calls peer.Stop. That calls peer.queue.Lock().
2. Another goroutine is in RoutineSequentialReceiver.
   It receives an elem from peer.queue.inbound.
3. The peer.Stop goroutine calls close(peer.queue.inbound),
   close(peer.queue.outbound), and peer.stopping.Wait().
   It blocks waiting for RoutineSequentialReceiver
   and RoutineSequentialSender to exit.
4. The RoutineSequentialReceiver goroutine calls peer.SendStagedPackets().
   SendStagedPackets attempts peer.queue.RLock().
   That blocks forever because the peer.Stop
   goroutine holds a write lock on that mutex.

A background motivation for this change is that it can be expensive
to have a mutex in the hot code path of RoutineSequential*.

The mutex was necessary to avoid attempting to send elems on a closed channel.
This commit removes that danger by never closing the channel.
Instead, we send a sentinel nil value on the channel to indicate
to the receiver that it should exit.

The only problem with this is that if the receiver exits,
we could write an elem into the channel which would never get received.
If it never gets received, it cannot get returned to the device pools.

To work around this, we use a finalizer. When the channel can be GC'd,
the finalizer drains any remaining elements from the channel and
restores them to the device pool.

After that change, peer.queue.RWMutex no longer makes sense where it is.
It is only used to prevent concurrent calls to Start and Stop.
Move it to a more sensible location and make it a plain sync.Mutex.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-08 13:02:52 -08:00
Josh Bleecher Snyder 57aadfcb14 device: create channels.go
We have a bunch of stupid channel tricks, and I'm about to add more.
Give them their own file. This commit is 100% code movement.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-08 12:38:19 -08:00
Josh Bleecher Snyder af408eb940 device: print direction when ping transit fails
Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-08 12:01:08 -08:00
Josh Bleecher Snyder 15810daa22 device: separate timersInit from timersStart
timersInit sets up the timers.
It need only be done once per peer.

timersStart does the work to prepare the timers
for a newly running peer. It needs to be done
every time a peer starts.

Separate the two and call them in the appropriate places.
This prevents data races on the peer's timers fields
when starting and stopping peers.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-08 10:32:07 -08:00
Josh Bleecher Snyder d840445e9b device: don't track device interface state in RoutineTUNEventReader
We already track this state elsewhere. No need to duplicate.
The cost of calling changeState is negligible.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-08 10:32:07 -08:00
Josh Bleecher Snyder 675ff32e6c device: improve MTU change handling
The old code silently accepted negative MTUs.
It also set MTUs above the maximum.
It also had hard to follow deeply nested conditionals.

Add more paranoid handling,
and make the code more straight-line.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-08 10:32:07 -08:00
Josh Bleecher Snyder 3516ccc1e2 device: remove device.state.stopping from RoutineTUNEventReader
The TUN event reader does three things: Change MTU, device up, and device down.
Changing the MTU after the device is closed does no harm.
Device up and device down don't make sense after the device is closed,
but we can check that condition before proceeding with changeState.
There's thus no reason to block device.Close on RoutineTUNEventReader exiting.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-08 10:32:07 -08:00
Josh Bleecher Snyder 0bcb822e5b device: overhaul device state management
This commit simplifies device state management.
It creates a single unified state variable and documents its semantics.

It also makes state changes more atomic.
As an example of the sort of bug that occurred due to non-atomic state changes,
the following sequence of events used to occur approximately every 2.5 million test runs:

* RoutineTUNEventReader received an EventDown event.
* It called device.Down, which called device.setUpDown.
* That set device.state.changing, but did not yet attempt to lock device.state.Mutex.
* Test completion called device.Close.
* device.Close locked device.state.Mutex.
* device.Close blocked on a call to device.state.stopping.Wait.
* device.setUpDown then attempted to lock device.state.Mutex and blocked.

Deadlock results. setUpDown cannot progress because device.state.Mutex is locked.
Until setUpDown returns, RoutineTUNEventReader cannot call device.state.stopping.Done.
Until device.state.stopping.Done gets called, device.state.stopping.Wait is blocked.
As long as device.state.stopping.Wait is blocked, device.state.Mutex cannot be unlocked.
This commit fixes that deadlock by holding device.state.mu
when checking that the device is not closed.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-08 10:32:07 -08:00
Josh Bleecher Snyder da95677203 device: remove unnecessary zeroing in peer.SendKeepalive
elem.packet is always already nil.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-08 10:14:17 -08:00
Josh Bleecher Snyder 9c75f58f3d device: remove device.state.stopping from RoutineHandshake
It is no longer necessary.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-08 08:18:32 -08:00
Josh Bleecher Snyder 84a42aed63 device: remove device.state.stopping from RoutineDecryption
It is no longer necessary, as of 454de6f3e64abd2a7bf9201579cd92eea5280996
(device: use channel close to shut down and drain decryption channel).

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-08 08:18:32 -08:00
Jason A. Donenfeld 01e176af3c device: take peer handshake when reinitializing last sent handshake
This papers over other unrelated races, unfortunately.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-03 17:52:31 +01:00
Josh Bleecher Snyder 91617b4c52 device: fix goroutine leak test
The leak test had rare flakes.
If a system goroutine started at just the wrong moment, you'd get a false positive.
Instead of looping until the goroutines look good and then checking,
exit completely as soon as the number of goroutines looks good.
Also, check more frequently, in an attempt to complete faster.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-03 17:45:22 +01:00
Jason A. Donenfeld 7258a8973d device: add up/down stress test
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-03 17:43:41 +01:00
Jason A. Donenfeld d9d547a3f3 device: pass cfg strings around in tests instead of reader
This makes it easier to tag things onto the end manually for quick hacks.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-03 17:29:01 +01:00
Jason A. Donenfeld c3bde5f590 device: benchmark the waitpool to compare it to the prior channels
Here is the old implementation:

    type WaitPool struct {
        c chan interface{}
    }

    func NewWaitPool(max uint32, new func() interface{}) *WaitPool {
        p := &WaitPool{c: make(chan interface{}, max)}
        for i := uint32(0); i < max; i++ {
            p.c <- new()
        }
        return p
    }

    func (p *WaitPool) Get() interface{} {
        return <- p.c
    }

    func (p *WaitPool) Put(x interface{}) {
        p.c <- x
    }

It performs worse than the new one:

    name         old time/op  new time/op  delta
    WaitPool-16  16.4µs ± 5%  15.1µs ± 3%  -7.86%  (p=0.008 n=5+5)

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-03 16:59:29 +01:00
Josh Bleecher Snyder fd63a233c9 device: test that we do not leak goroutines
Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-03 00:57:57 +01:00
Josh Bleecher Snyder 8a374a35a0 device: tie encryption queue lifetime to the peers that write to it
Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-03 00:57:57 +01:00
Jason A. Donenfeld 4846070322 device: use a waiting sync.Pool instead of a channel
Channels are FIFO which means we have guaranteed cache misses.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-02 19:32:13 +01:00
Jason A. Donenfeld a9f80d8c58 device: reduce number of append calls when padding
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-29 20:10:48 +01:00
Jason A. Donenfeld de51129e33 device: use int64 instead of atomic.Value for time stamp
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-29 18:57:03 +01:00
Jason A. Donenfeld beb25cc4fd device: use new model queues for handshakes
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-29 18:24:45 +01:00
Jason A. Donenfeld 9263014ed3 device: simplify peer queue locking
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-29 16:21:53 +01:00