Commit graph

235 commits

Author SHA1 Message Date
Jason A. Donenfeld 5bf8d73127 device: create peer queues at peer creation time
Rather than racing with Start(), since we're never destroying these
queues, we just set the variables at creation time.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-10 00:21:12 +01:00
Jason A. Donenfeld 587a2b2a20 device: return error from Up() and Down()
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-10 00:12:23 +01:00
Jason A. Donenfeld 6f08a10041 rwcancel: add an explicit close call
This lets us collect FDs even if the GC doesn't do it for us.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-09 20:19:14 +01:00
Jason A. Donenfeld da32fe328b device: handshake routine writes into encryption queue
Since RoutineHandshake calls peer.SendKeepalive(), it potentially is a
writer into the encryption queue, so we need to bump the wg count.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-09 19:26:45 +01:00
Josh Bleecher Snyder 4eab21a7b7 device: make RoutineReadFromTUN keep encryption queue alive
RoutineReadFromTUN can trigger a call to SendStagedPackets.
SendStagedPackets attempts to protect against sending
on the encryption queue by checking peer.isRunning and device.isClosed.
However, those are subject to TOCTOU bugs.

If that happens, we get this:

goroutine 1254 [running]:
golang.zx2c4.com/wireguard/device.(*Peer).SendStagedPackets(0xc000798300)
        .../wireguard-go/device/send.go:321 +0x125
golang.zx2c4.com/wireguard/device.(*Device).RoutineReadFromTUN(0xc000014780)
        .../wireguard-go/device/send.go:271 +0x21c
created by golang.zx2c4.com/wireguard/device.NewDevice
        .../wireguard-go/device/device.go:315 +0x298

Fix this with a simple, big hammer: Keep the encryption queue
alive as long as it might be written to.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-09 09:53:00 -08:00
Josh Bleecher Snyder 78ebce6932 device: only allocate peer queues once
This serves two purposes.

First, it makes repeatedly stopping then starting a peer cheaper.
Second, it prevents a data race observed accessing the queues.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-09 18:33:48 +01:00
Josh Bleecher Snyder cae090d116 device: clarify device.state.state docs (again)
Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-09 18:29:01 +01:00
Josh Bleecher Snyder 465261310b device: run fewer iterations in TestUpDown
The high iteration count was useful when TestUpDown
was the nexus of new bugs to investigate.

Now that it has stabilized, that's less valuable.
And it slows down running the tests and crowds out other tests.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-09 18:28:59 +01:00
Josh Bleecher Snyder d117d42ae7 device: run fewer trials in TestWaitPool when race detector enabled
On a many-core machine with the race detector enabled,
this test can take several minutes to complete.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-09 18:28:58 +01:00
Josh Bleecher Snyder ecceaadd16 device: remove nil elem check in finalizers
This is not necessary, and removing it speeds up detection of UAF bugs.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-09 18:28:55 +01:00
Jason A. Donenfeld 9e728c2eb0 device: rename unsafeRemovePeer to removePeerLocked
This matches the new naming scheme of upLocked and downLocked.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-09 16:11:33 +01:00
Jason A. Donenfeld eaf664e4e9 device: remove deviceStateNew
It's never used and we won't have a use for it. Also, move to go-running
stringer, for those without GOPATHs.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-09 15:39:19 +01:00
Jason A. Donenfeld a816e8511e device: fix comment typo and shorten state.mu.Lock to state.Lock
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-09 15:37:04 +01:00
Jason A. Donenfeld 02138f1f81 device: fix typo in comment
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-09 15:37:04 +01:00
Jason A. Donenfeld d7bc7508e5 device: fix alignment on 32-bit machines and test for it
The test previously checked the offset within a substruct, not the
offset within the allocated struct, so this adds the two together.

It then fixes an alignment crash on 32-bit machines.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-09 15:37:04 +01:00
Jason A. Donenfeld d6e76fdbd6 device: do not log on idempotent device state change
Part of being actually idempotent is that we shouldn't penalize code
that takes advantage of this property with a log splat.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-09 15:37:04 +01:00
Jason A. Donenfeld 6ac1240821 device: do not attach finalizer to non-returned object
Before, the code attached a finalizer to an object that wasn't returned,
resulting in immediate garbage collection. Instead return the actual
pointer.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-09 15:37:04 +01:00
Jason A. Donenfeld 4b5d15ec2b device: lock elem in autodraining queue before freeing
Without this, we wind up freeing packets that the encryption/decryption
queues still have, resulting in a UaF.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-09 15:37:04 +01:00
Jason A. Donenfeld 6548a682a9 device: remove listen port race in tests
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-09 15:37:04 +01:00
Jason A. Donenfeld a60e6dab76 device: generate test keys on the fly
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-09 00:42:39 +01:00
Josh Bleecher Snyder d8dd1f254f device: remove mutex from Peer send/receive
The immediate motivation for this change is an observed deadlock.

1. A goroutine calls peer.Stop. That calls peer.queue.Lock().
2. Another goroutine is in RoutineSequentialReceiver.
   It receives an elem from peer.queue.inbound.
3. The peer.Stop goroutine calls close(peer.queue.inbound),
   close(peer.queue.outbound), and peer.stopping.Wait().
   It blocks waiting for RoutineSequentialReceiver
   and RoutineSequentialSender to exit.
4. The RoutineSequentialReceiver goroutine calls peer.SendStagedPackets().
   SendStagedPackets attempts peer.queue.RLock().
   That blocks forever because the peer.Stop
   goroutine holds a write lock on that mutex.

A background motivation for this change is that it can be expensive
to have a mutex in the hot code path of RoutineSequential*.

The mutex was necessary to avoid attempting to send elems on a closed channel.
This commit removes that danger by never closing the channel.
Instead, we send a sentinel nil value on the channel to indicate
to the receiver that it should exit.

The only problem with this is that if the receiver exits,
we could write an elem into the channel which would never get received.
If it never gets received, it cannot get returned to the device pools.

To work around this, we use a finalizer. When the channel can be GC'd,
the finalizer drains any remaining elements from the channel and
restores them to the device pool.

After that change, peer.queue.RWMutex no longer makes sense where it is.
It is only used to prevent concurrent calls to Start and Stop.
Move it to a more sensible location and make it a plain sync.Mutex.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-08 13:02:52 -08:00
Josh Bleecher Snyder 57aadfcb14 device: create channels.go
We have a bunch of stupid channel tricks, and I'm about to add more.
Give them their own file. This commit is 100% code movement.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-08 12:38:19 -08:00
Josh Bleecher Snyder af408eb940 device: print direction when ping transit fails
Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-08 12:01:08 -08:00
Josh Bleecher Snyder 15810daa22 device: separate timersInit from timersStart
timersInit sets up the timers.
It need only be done once per peer.

timersStart does the work to prepare the timers
for a newly running peer. It needs to be done
every time a peer starts.

Separate the two and call them in the appropriate places.
This prevents data races on the peer's timers fields
when starting and stopping peers.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-08 10:32:07 -08:00
Josh Bleecher Snyder d840445e9b device: don't track device interface state in RoutineTUNEventReader
We already track this state elsewhere. No need to duplicate.
The cost of calling changeState is negligible.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-08 10:32:07 -08:00
Josh Bleecher Snyder 675ff32e6c device: improve MTU change handling
The old code silently accepted negative MTUs.
It also set MTUs above the maximum.
It also had hard to follow deeply nested conditionals.

Add more paranoid handling,
and make the code more straight-line.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-08 10:32:07 -08:00
Josh Bleecher Snyder 3516ccc1e2 device: remove device.state.stopping from RoutineTUNEventReader
The TUN event reader does three things: Change MTU, device up, and device down.
Changing the MTU after the device is closed does no harm.
Device up and device down don't make sense after the device is closed,
but we can check that condition before proceeding with changeState.
There's thus no reason to block device.Close on RoutineTUNEventReader exiting.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-08 10:32:07 -08:00
Josh Bleecher Snyder 0bcb822e5b device: overhaul device state management
This commit simplifies device state management.
It creates a single unified state variable and documents its semantics.

It also makes state changes more atomic.
As an example of the sort of bug that occurred due to non-atomic state changes,
the following sequence of events used to occur approximately every 2.5 million test runs:

* RoutineTUNEventReader received an EventDown event.
* It called device.Down, which called device.setUpDown.
* That set device.state.changing, but did not yet attempt to lock device.state.Mutex.
* Test completion called device.Close.
* device.Close locked device.state.Mutex.
* device.Close blocked on a call to device.state.stopping.Wait.
* device.setUpDown then attempted to lock device.state.Mutex and blocked.

Deadlock results. setUpDown cannot progress because device.state.Mutex is locked.
Until setUpDown returns, RoutineTUNEventReader cannot call device.state.stopping.Done.
Until device.state.stopping.Done gets called, device.state.stopping.Wait is blocked.
As long as device.state.stopping.Wait is blocked, device.state.Mutex cannot be unlocked.
This commit fixes that deadlock by holding device.state.mu
when checking that the device is not closed.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-08 10:32:07 -08:00
Josh Bleecher Snyder da95677203 device: remove unnecessary zeroing in peer.SendKeepalive
elem.packet is always already nil.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-08 10:14:17 -08:00
Josh Bleecher Snyder 9c75f58f3d device: remove device.state.stopping from RoutineHandshake
It is no longer necessary.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-08 08:18:32 -08:00
Josh Bleecher Snyder 84a42aed63 device: remove device.state.stopping from RoutineDecryption
It is no longer necessary, as of 454de6f3e64abd2a7bf9201579cd92eea5280996
(device: use channel close to shut down and drain decryption channel).

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-08 08:18:32 -08:00
Jason A. Donenfeld 01e176af3c device: take peer handshake when reinitializing last sent handshake
This papers over other unrelated races, unfortunately.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-03 17:52:31 +01:00
Josh Bleecher Snyder 91617b4c52 device: fix goroutine leak test
The leak test had rare flakes.
If a system goroutine started at just the wrong moment, you'd get a false positive.
Instead of looping until the goroutines look good and then checking,
exit completely as soon as the number of goroutines looks good.
Also, check more frequently, in an attempt to complete faster.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-03 17:45:22 +01:00
Jason A. Donenfeld 7258a8973d device: add up/down stress test
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-03 17:43:41 +01:00
Jason A. Donenfeld d9d547a3f3 device: pass cfg strings around in tests instead of reader
This makes it easier to tag things onto the end manually for quick hacks.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-03 17:29:01 +01:00
Jason A. Donenfeld c3bde5f590 device: benchmark the waitpool to compare it to the prior channels
Here is the old implementation:

    type WaitPool struct {
        c chan interface{}
    }

    func NewWaitPool(max uint32, new func() interface{}) *WaitPool {
        p := &WaitPool{c: make(chan interface{}, max)}
        for i := uint32(0); i < max; i++ {
            p.c <- new()
        }
        return p
    }

    func (p *WaitPool) Get() interface{} {
        return <- p.c
    }

    func (p *WaitPool) Put(x interface{}) {
        p.c <- x
    }

It performs worse than the new one:

    name         old time/op  new time/op  delta
    WaitPool-16  16.4µs ± 5%  15.1µs ± 3%  -7.86%  (p=0.008 n=5+5)

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-03 16:59:29 +01:00
Josh Bleecher Snyder fd63a233c9 device: test that we do not leak goroutines
Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-03 00:57:57 +01:00
Josh Bleecher Snyder 8a374a35a0 device: tie encryption queue lifetime to the peers that write to it
Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-02-03 00:57:57 +01:00
Jason A. Donenfeld 4846070322 device: use a waiting sync.Pool instead of a channel
Channels are FIFO which means we have guaranteed cache misses.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-02 19:32:13 +01:00
Jason A. Donenfeld a9f80d8c58 device: reduce number of append calls when padding
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-29 20:10:48 +01:00
Jason A. Donenfeld de51129e33 device: use int64 instead of atomic.Value for time stamp
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-29 18:57:03 +01:00
Jason A. Donenfeld beb25cc4fd device: use new model queues for handshakes
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-29 18:24:45 +01:00
Jason A. Donenfeld 9263014ed3 device: simplify peer queue locking
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-29 16:21:53 +01:00
Jason A. Donenfeld f0f27d7fd2 device: reduce nesting when staging packet
Suggested-by: Josh Bleecher Snyder <josh@tailscale.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-28 18:56:58 +01:00
Jason A. Donenfeld d4112d9096 global: bump copyright
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-28 17:52:15 +01:00
Jason A. Donenfeld 6a128dde71 device: do not allow get to run while set runs
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-28 15:26:22 +01:00
Jason A. Donenfeld 34c047c762 device: avoid hex allocations in IpcGet
benchmark               old ns/op     new ns/op     delta
BenchmarkUAPIGet-16     2872          2157          -24.90%

benchmark               old allocs     new allocs     delta
BenchmarkUAPIGet-16     30             18             -40.00%

benchmark               old bytes     new bytes     delta
BenchmarkUAPIGet-16     737           256           -65.26%

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-28 15:22:34 +01:00
Jason A. Donenfeld d4725bc456 device: the psk is not a chapoly key
It's a separate type of key that gets hashed into the chain.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-28 14:45:53 +01:00
Jason A. Donenfeld 1b092ce584 device: get rid of nonce routine
This moves to a simple queue with no routine processing it, to reduce
scheduler pressure.

This splits latency in half!

benchmark                  old ns/op     new ns/op     delta
BenchmarkThroughput-16     2394          2364          -1.25%
BenchmarkLatency-16        259652        120810        -53.47%

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-27 18:38:27 +01:00
Jason A. Donenfeld ace50a0529 device: avoid deadlock when changing private key and removing self peers
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-27 15:53:21 +01:00
Jason A. Donenfeld 8cc99631d0 device: use linked list for per-peer allowed-ip traversal
This makes the IpcGet method much faster.

We also refactor the traversal API to use a callback so that we don't
need to allocate at all. Avoiding allocations we do self-masking on
insertion, which in turn means that split intermediate nodes require a
copy of the bits.

benchmark               old ns/op     new ns/op     delta
BenchmarkUAPIGet-16     3243          2659          -18.01%

benchmark               old allocs     new allocs     delta
BenchmarkUAPIGet-16     35             30             -14.29%

benchmark               old bytes     new bytes     delta
BenchmarkUAPIGet-16     1218          737           -39.49%

This benchmark is good, though it's only for a pair of peers, each with
only one allowedips. As this grows, the delta expands considerably.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-27 01:48:58 +01:00
Jason A. Donenfeld d669c78c43 device: combine debug and info log levels into 'verbose'
There are very few cases, if any, in which a user only wants one of
these levels, so combine it into a single level.

While we're at it, reduce indirection on the loggers by using an empty
function rather than a nil function pointer. It's not like we have
retpolines anyway, and we were always calling through a function with a
branch prior, so this seems like a net gain.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-26 23:05:48 +01:00
Josh Bleecher Snyder 7139279cd0 device: change logging interface to use functions
This commit overhauls wireguard-go's logging.

The primary, motivating change is to use a function instead
of a *log.Logger as the basic unit of logging.
Using functions provides a lot more flexibility for
people to bring their own logging system.

It also introduces logging helper methods on Device.
These reduce line noise at the call site.
They also allow for log functions to be nil;
when nil, instead of generating a log line and throwing it away,
we don't bother generating it at all.
This spares allocation and pointless work.

This is a breaking change, although the fix required
of clients is fairly straightforward.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-26 22:40:20 +01:00
Josh Bleecher Snyder 37efdcaccf device: fix shadowing of err in IpcHandle
The declaration of err in

	nextByte, err := buffered.ReadByte

shadows the declaration of err in

	op, err := buffered.ReadString('\n')

above. As a result, the assignments to err in

	err = ipcErrorf(ipc.IpcErrorInvalid, "trailing character in UAPI get: %c", nextByte)

and in

	err = device.IpcGetOperation(buffered.Writer)

do not modify the correct err variable.

Found by staticcheck.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-26 22:40:10 +01:00
Josh Bleecher Snyder d3a2b74df2 device: remove extra error arg
Caught by go vet.
Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-26 22:36:10 +01:00
Brad Fitzpatrick 8114c9db5f device: reduce allocs in Device.IpcGetOperation
Plenty more to go, but a start:

name       old time/op    new time/op    delta
UAPIGet-4    6.37µs ± 2%    5.56µs ± 1%  -12.70%  (p=0.000 n=8+8)

name       old alloc/op   new alloc/op   delta
UAPIGet-4    1.98kB ± 0%    1.22kB ± 0%  -38.71%  (p=0.000 n=10+10)

name       old allocs/op  new allocs/op  delta
UAPIGet-4      42.0 ± 0%      35.0 ± 0%  -16.67%  (p=0.000 n=10+10)

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2021-01-26 11:51:52 -08:00
Josh Bleecher Snyder e6ec3852a9 device: add benchmark for UAPI Device.IpcGetOperation
Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-26 11:40:24 -08:00
Jason A. Donenfeld 18e47795e5 device: allow pipelining UAPI requests
The original spec ends with \n\n especially for this reason.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-25 20:48:28 +01:00
Josh Bleecher Snyder cecb41515d device: serialize access to IpcSetOperation
Interleaves IpcSetOperations would spell trouble.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-25 09:38:09 -08:00
Josh Bleecher Snyder a9ce4b762c device: simplify handling of IPC set endpoint
Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-25 09:37:28 -08:00
Josh Bleecher Snyder d8f2cc87ee device: remove close processing fwmark
Also, a behavior change: Stop treating a blank value as 0.
It's not in the spec.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-25 09:36:53 -08:00
Josh Bleecher Snyder 2b8665f5f9 device: remove unnecessary comment
Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-25 09:36:41 -08:00
Josh Bleecher Snyder 674a4675a1 device: introduce new IPC error message for unknown error
Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-25 09:36:17 -08:00
Josh Bleecher Snyder 87bdcb2ae4 device: correct IPC error number for I/O errors
Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-25 09:35:48 -08:00
Josh Bleecher Snyder 37a239e736 device: simplify IpcHandle error handling
Unify the handling of unexpected UAPI errors.
The comment that says "should never happen" is incorrect;
this could happen due to I/O errors. Correct it.

Change error message capitalization for consistency.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-25 09:09:24 -08:00
Josh Bleecher Snyder 6252de0db9 device: split IpcSetOperation into parts
The goal of this change is to make the structure
of IpcSetOperation easier to follow.

IpcSetOperation contains a small state machine:
It starts by configuring the device,
then shifts to configuring one peer at a time.

Having the code all in one giant method obscured that structure.
Split out the parts into helper functions and encapsulate the peer state.

This makes the overall structure more apparent.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-25 09:09:24 -08:00
Josh Bleecher Snyder a029b942ae device: expand IPCError
Expand IPCError to contain a wrapped error,
and add a helper to make constructing such errors easier.

Add a defer-based "log on returned error" to IpcSetOperation.
This lets us simplify all of the error return paths.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-25 08:47:48 -08:00
Josh Bleecher Snyder db3fa1409c device: remove dead code
If device.NewPeer returns a nil error,
then the returned peer is always non-nil.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-25 08:47:48 -08:00
Josh Bleecher Snyder 675aae2423 device: return errors from ipc scanner
The code as written will drop any read errors on the floor.
Fix that.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-25 08:47:48 -08:00
Jason A. Donenfeld 294d3bedf9 device: allow compiling with Go 1.15
Until we depend on Go 1.16 (which isn't released yet), alias our own
variable to the private member of the net package. This will allow an
easy find replace to make this go away when we eventually switch to
1.16.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-20 20:12:32 +01:00
Josh Bleecher Snyder 86a58b51c0 device: remove unused fields from DummyDatagram and DummyBind
Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-20 20:03:40 +01:00
Josh Bleecher Snyder 6a2ecb581b device: remove unused trie test code
Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-20 20:03:40 +01:00
Josh Bleecher Snyder 7c5d1e355e device: remove unnecessary zeroing
Newly allocated objects are already zeroed.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-20 19:57:07 +01:00
Josh Bleecher Snyder a86492a567 device: remove QueueInboundElement.dropped
Now that we block when enqueueing to the decryption queue,
there is only one case in which we "drop" a inbound element,
when decryption fails.

We can use a simple, obvious, sync-free sentinel for that, elem.packet == nil.
Also, we can return the message buffer to the pool slightly later,
which further simplifies the code.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-20 19:57:06 +01:00
Josh Bleecher Snyder 7ee95e053c device: remove QueueOutboundElement.dropped
If we block when enqueuing encryption elements to the queue,
then we never drop them.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-20 19:57:05 +01:00
Josh Bleecher Snyder 23642a13be device: check returned errors from NewPeer in TestNoiseHandshake
Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-20 19:57:01 +01:00
Josh Bleecher Snyder 2fe19ce54d device: remove selects from encrypt/decrypt/inbound/outbound enqueuing
Block instead. Backpressure here is fine, probably preferable.
This reduces code complexity.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-20 19:57:00 +01:00
Josh Bleecher Snyder 0cc15e7c7c device: put handshake buffer in pool in FlushPacketQueues
This appears to have been an oversight.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-20 19:56:59 +01:00
Josh Bleecher Snyder 48c3b87eb8 device: use channel close to shut down and drain decryption channel
This is similar to commit e1fa1cc556,
but for the decryption channel.

It is an alternative fix to f9f655567930a4cd78d40fa4ba0d58503335ae6a.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-20 19:56:54 +01:00
Jason A. Donenfeld ea6c1cd7e6 device: receive: do not exit immediately on transient UDP receive errors
Some users report seeing lines like:

> Routine: receive incoming IPv4 - stopped

Popping up unexpectedly. Let's sleep and try again before failing, and
also log the error, and perhaps we'll eventually understand this
situation better in future versions.

Because we have to distinguish between the socket being closed
explicitly and whatever error this is, we bump the module to require Go
1.16.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-08 14:30:04 +01:00
Jason A. Donenfeld 29b0477585 device: receive: drain decryption queue before exiting RoutineDecryption
It's possible for RoutineSequentialReceiver to try to lock an elem after
RoutineDecryption has exited. Before this meant we didn't then unlock
the elem, so the whole program deadlocked.

As well, it looks like the flush code (which is now potentially
unnecessary?) wasn't properly dropping the buffers for the
not-already-dropped case.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-07 17:08:41 +01:00
Josh Bleecher Snyder 85b4950579 device: add latency and throughput benchmarks
These obviously don't perfectly capture real world performance,
in which syscalls and network links have a significant impact.
Nevertheless, they capture some of the internal performance factors,
and they're easy and convenient to work with.

Hat tip to Avery Pennarun for help designing the throughput benchmark.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-07 14:49:44 +01:00
Josh Bleecher Snyder 8a30415555 device: use LogLevelError for benchmarking
This keeps the output minimal and focused on the benchmark results.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-07 14:49:44 +01:00
Josh Bleecher Snyder cdaf4e9a76 device: make test infrastructure usable with benchmarks
Switch from *testing.T to testing.TB.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-07 14:49:44 +01:00
Josh Bleecher Snyder 1481e72107 all: use ++ to increment
Make the code slightly more idiomatic. No functional changes.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-07 14:49:44 +01:00
Josh Bleecher Snyder d0f8e9477c device: remove unnecessary zeroing
Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-07 14:49:44 +01:00
Josh Bleecher Snyder b42e32047d device: call wg.Add outside the goroutine
One of the first rules of WaitGroups is that you call wg.Add
outside of a goroutine, not inside it. Fix this embarrassing mistake.

This prevents an extremely rare race condition (2 per 100,000 runs)
which could occur when attempting to start a new peer
concurrently with shutting down a device.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-07 14:49:44 +01:00
Josh Bleecher Snyder b5f966ac24 device: remove QueueInboundElement leak with stopped peers
This is particularly problematic on mobile,
where there is a fixed number of elements.
If most of them leak, it'll impact performance;
if all of them leak, the device will permanently deadlock.

I have a test that detects element leaks, which is how I found this one.
There are some remaining leaks that I have not yet tracked down,
but this is the most prominent by far.

I will commit the test when it passes reliably.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-07 14:49:44 +01:00
Josh Bleecher Snyder a1c265b0c5 device: simplify UAPI helper methods
bufio is not required.

strings.Builder is cheaper than bytes.Buffer for constructing strings.

io.Writer is more flexible than io.StringWriter,
and just as cheap (when used with io.WriteString).

Run gofmt.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-07 14:49:44 +01:00
Jason A. Donenfeld 25b01723dd device: fix alignment of peer stats member
This was shifted by 2 bytes when making persistent keepalive into a u32.
Fix it by placing it after the aligned region.

Fixes: e739ff7 ("device: fix persistent_keepalive_interval data races")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-07 14:49:44 +01:00
Jason A. Donenfeld 40dfc85def device: add UAPI helper methods
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-07 14:49:44 +01:00
Jason A. Donenfeld ad73ee78e9 device: add missing colon to error line
People are actually hitting this condition, so make it uniform. Also,
change a printf into a println, to match the other conventions.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-07 14:49:44 +01:00
Brad Fitzpatrick e9edc16349 device: fix error shadowing before log print
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
2021-01-07 14:49:44 +01:00
Josh Bleecher Snyder f7bbdc31a0 device: fix data race in peer.timersActive
Found by the race detector and existing tests.

To avoid introducing a lock into this hot path,
calculate and cache whether any peers exist.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-07 14:49:44 +01:00
Josh Bleecher Snyder 70861686d3 device: fix races from changing private_key
Access keypair.sendNonce atomically.
Eliminate one unnecessary initialization to zero.

Mutate handshake.lastSentHandshake with the mutex held.

Co-authored-by: David Anderson <danderson@tailscale.com>
Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-07 14:49:44 +01:00
Josh Bleecher Snyder c8faa34cde device: always name *Queue*Element variables elem
They're called elem in most places.
Rename a few local variables to make it consistent.
This makes it easier to grep the code for things like elem.Drop.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-07 14:49:44 +01:00
Josh Bleecher Snyder 2832e96339 device: use channel close to shut down and drain outbound channel
This is a similar treatment to the handling of the encryption
channel found a few commits ago: Use the closing of the channel
to manage goroutine lifetime and shutdown.
It is considerably simpler because there is only a single writer.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-07 14:49:44 +01:00
Josh Bleecher Snyder 63066ce406 device: fix persistent_keepalive_interval data races
Co-authored-by: David Anderson <danderson@tailscale.com>
Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-07 14:49:44 +01:00
Josh Bleecher Snyder e1fa1cc556 device: use channel close to shut down and drain encryption channel
The new test introduced in this commit used to deadlock about 1% of the time.

I believe that the deadlock occurs as follows:

* The test completes, calling device.Close.
* device.Close closes device.signals.stop.
* RoutineEncryption stops.
* The deferred function in RoutineEncryption drains device.queue.encryption.
* RoutineEncryption exits.
* A peer's RoutineNonce processes an element queued in peer.queue.nonce.
* RoutineNonce puts that element into the outbound and encryption queues.
* RoutineSequentialSender reads that elements from the outbound queue.
* It waits for that element to get Unlocked by RoutineEncryption.
* RoutineEncryption has already exited, so RoutineSequentialSender blocks forever.
* device.RemoveAllPeers calls peer.Stop on all peers.
* peer.Stop waits for peer.routines.stopping, which blocks forever.

Rather than attempt to add even more ordering to the already complex
centralized shutdown orchestration, this commit moves towards a
data-flow-oriented shutdown.

The device.queue.encryption gets closed when there will be no more writes to it.
All device.queue.encryption readers always read until the channel is closed and then exit.
We thus guarantee that any element that enters the encryption queue also exits it.
This removes the need for central control of the lifetime of RoutineEncryption,
removes the need to drain the encryption queue on shutdown, and simplifies RoutineEncryption.

This commit also fixes a data race. When RoutineSequentialSender
drains its queue on shutdown, it needs to lock the elem before operating on it,
just as the main body does.

The new test in this commit passed 50k iterations with the race detector enabled
and 150k iterations with the race detector disabled, with no failures.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-07 14:49:44 +01:00
Josh Bleecher Snyder 41cd68416c device: simplify copying counter to nonce
Since we already have it packed into a uint64
in a known byte order, write it back out again
the same byte order instead of copying byte by byte.

This should also generate more efficient code,
because the compiler can do a single uint64 write,
instead of eight bounds checks and eight byte writes.

Due to a missed optimization, it actually generates a mishmash
of smaller writes: 1 byte, 4 bytes, 2 bytes, 1 byte.
This is https://golang.org/issue/41663.
The code is still better than before, and will get better yet
once that compiler bug gets fixed.

Signed-off-by: Josh Bleecher Snyder <josh@tailscale.com>
2021-01-07 14:49:44 +01:00