5. Torflow measurements aggregation and scaling

This documentation is a more extended version than Torflow aggregation and scaling or [BandwidthFileSpec].

5.1. Scaling without PID

From Torflow’s README.spec.txt (section 2.2)

   In this way, the resulting network status consensus bandwidth values
   are effectively re-weighted proportional to how much faster the node
   was as compared to the rest of the network.

The variables and steps used in Torflow:

strm_bw

   The strm_bw field is the average (mean) of all the streams for the relay
   identified by the fingerprint field. 

bw_i = \mu(bw_j) = \frac{\sum_{j=1}^{n}bw_j}{n}

filt_bw

   The filt_bw field is computed similarly, but only the streams equal to
   or greater than the strm_bw are counted in order to filter very slow
   streams due to slow node pairings.

bwfilt_j &= max(\mu(bw_j), bw_j) = max\left(\frac{\sum_{j=1}^{n}bw_j}{n}, bw_j\right) bwfilt_i &= max(\mu(bw_j), bw_j) = \frac{\sum_{j=1}^{n}max(\mu, bw_j)}{n_{bwfiltj}} = \frac{\sum_{j=1}^{n}max\left(\frac{\sum_{j=1}^{n}bw_j}{n}, bw_j\right)}{n_{bwfiltj}}

filt_sbw and strm_sbw

    for rs in RouterStats.query.filter(stats_clause).\
          options(eagerload_all('router.streams.circuit.routers')).all():
      tot_sbw = 0
      sbw_cnt = 0
      for s in rs.router.streams:
        if isinstance(s, ClosedStream):
          skip = False
          #for br in badrouters:
          #  if br != rs:
          #    if br.router in s.circuit.routers:
          #      skip = True
          if not skip:
            # Throw out outliers < mean
            # (too much variance for stddev to filter much)
            if rs.strm_closed == 1 or s.bandwidth() >= rs.sbw:
              tot_sbw += s.bandwidth()
              sbw_cnt += 1

      if sbw_cnt: rs.filt_sbw = tot_sbw/sbw_cnt
      else: rs.filt_sbw = None

filt_avg, and strm_avg

   Once we have determined the most recent measurements for each node, we
   compute an average of the filt_bw fields over all nodes we have measured.
    filt_avg = sum(map(lambda n: n.filt_bw, nodes.itervalues()))/float(len(nodes))
    strm_avg = sum(map(lambda n: n.strm_bw, nodes.itervalues()))/float(len(nodes))

bwfilt &= \mu(bwfilt_i) = \frac{\sum_{i=1}^{n}bwfilt_i}{n} = \frac{\sum_{i=1}^{n}\frac{\sum_{j=1}^{n}max(\mu, bw_j)}{n_{bwfiltj}}}{n} = \frac{\sum_{i=1}^{n}\frac{\sum_{j=1}^{n}max\left(\frac{\sum_{j=1}^{n}bw_j}{n}, bw_j\right)}{n_{bwfiltj}}}{n}

bwstrm &= \mu(bw_i) = \frac{\sum_{i=1}^{n}\mu_i}{n}

true_filt_avg and true_strm_avg

      for cl in ["Guard+Exit", "Guard", "Exit", "Middle"]:
        true_filt_avg[cl] = filt_avg
        true_strm_avg[cl] = strm_avg

In the non-pid case, all types of nodes get the same avg

n.fbw_ratio and n.fsw_ratio

  for n in nodes.itervalues():
    n.fbw_ratio = n.filt_bw/true_filt_avg[n.node_class()]
    n.sbw_ratio = n.strm_bw/true_strm_avg[n.node_class()]

rf_i = \frac{bwfilt_i}{bwfilt} rs_i = \frac{bw_i}{bwstrm}

n.ratio

   These averages are used to produce ratios for each node by dividing the
   measured value for that node by the network average. 
      # Choose the larger between sbw and fbw
      if n.sbw_ratio > n.fbw_ratio:
        n.ratio = n.sbw_ratio
      else:
        n.ratio = n.fbw_ratio

      n.pid_error = 0
      n.pid_error_sum = 0
      n.new_bw = n.desc_bw*n.ratio
      n.pid_bw = n.new_bw # for transition between pid/no-pid

    n.change = n.new_bw - n.desc_bw

    if n.idhex in prev_consensus:
      if prev_consensus[n.idhex].bandwidth != None:
        prev_consensus[n.idhex].measured = True
        tot_net_bw += n.new_bw
      if IGNORE_GUARDS \
           and ("Guard" in prev_consensus[n.idhex].flags and not "Exit" in \
                  prev_consensus[n.idhex].flags):
        plog("INFO", "Skipping voting for guard "+n.nick)
        n.ignore = True
      elif "Authority" in prev_consensus[n.idhex].flags:

r_i = max(rf_i, rs_i) = max\left(\frac{bwfilt_i}{bwfilt}, \frac{bw_i}{bwstrm}\right) = max\left(\frac{bwfilt_i}{\mu(bwfilt_i)}, \frac{bw_i}{\mu(bwfilt_i)}\right)

desc_bw

It is the observed bandwidth in the descriptor, NOT the average bandwidth

    return Router(ns.idhex, ns.nickname, bw_observed, dead, exitpolicy,
        ns.flags, ip, version, os, uptime, published, contact, rate_limited,
        ns.orhash, ns.bandwidth, extra_info_digest, ns.unmeasured)
    self.desc_bw = max(bw,1) # Avoid div by 0

new_bw

   These ratios are then multiplied by the most recent observed descriptor
   bandwidth we have available for each node, to produce a new value for
   the network status consensus process.
      n.new_bw = n.desc_bw*n.ratio

The descriptor observed bandwidth is multiplied by the ratio.

bwnew_i = bwobs_i \times r_i = bwobs_i \times max(rf_i, rs_i) = bwobs_i \times max\left(\frac{bwfilt_i}{bwfilt}, \frac{bw_i}{bwstrm}\right) = bwobs_i \times max\left(\frac{bwfilt_i}{\mu(bwfilt_i)}, \frac{bw_i}{\mu(bw_i)}\right)

Limit the bandwidth to a maximum

NODE_CAP = 0.05
    if n.new_bw > tot_net_bw*NODE_CAP:
      plog("INFO", "Clipping extremely fast "+n.node_class()+" node "+n.idhex+"="+n.nick+
           " at "+str(100*NODE_CAP)+"% of network capacity ("+
           str(n.new_bw)+"->"+str(int(tot_net_bw*NODE_CAP))+") "+
           " pid_error="+str(n.pid_error)+
           " pid_error_sum="+str(n.pid_error_sum))
      n.new_bw = int(tot_net_bw*NODE_CAP)

However, tot_net_bw does not seems to be updated when not using pid. This clipping would make faster relays to all have the same value.

bwn_i =& min\left(bwnew_i, \sum_{i=1}^{n}bwnew_i \times 0.05\right) \ &= min\left( \left(bwobs_i \times r_i\right), \sum_{i=1}^{n}\left(bwobs_i \times r_i\right) \times 0.05\right)\ &= min\left( \left(bwobs_i \times max\left(rf_i, rs_i\right)\right), \sum_{i=1}^{n}\left(bwobs_i \times max\left(rf_i, rs_i\right)\right) \times 0.05\right)\ &= min\left( \left(bwobs_i \times max\left(\frac{bwfilt_i}{bwfilt}, \frac{bw_i}{bwstrm}\right)\right), \sum_{i=1}^{n}\left(bwobs_i \times max\left(\frac{bwfilt_i}{bwfilt}, \frac{bw_i}{bwstrm}\right)\right) \times 0.05\right)

5.2. Torflow PID Control feedback

   The bandwidth authorities measure F_node: the filtered stream
   capacity through a given node (filtering is described in Section 1.6).

(=`filt_bw`?)

   The PID Setpoint, or target for each node is F_avg: the average F_node
   value observed across the entire network.

(=`filt_avg`?)

do we really want SP to be F_avg?

   The normalized PID error e(t) for each node is then:

       pid_error = e(t) = (F_node - F_avg)/F_avg. 
bwfilt = filt_bw

e(t) = \frac{bwfilt_i}{\mu_{bwfilt}} - 1

of the entire population or the sample of this measurement?

   In the "Process" step, we take the output of the "controller" and multiply it by
   the current consensus bandwidth for the node, and then add this new
   proportion to the consensus bandwidth, thereby adjusting the
   consensus bandwidth in proportion to the error:

     new_consensus_bw = old_consensus_bw +
                        old_consensus_bw * K_p * e(t) +
                        old_consensus_bw * K_i * \integral{e(t)} +
                        old_consensus_bw * K_d * \derivative{e(t)}
new_consensus_bw = bwwn
old_consensus_bw = cbw_o

bwwn = bwwo + bwwo K_p e(t) + bwwo K_i \int _{0}^{t}{e(t)} + bwwo K_d \frac {d}{dt}{e(t)}

why the 1st term?

   For the case where K_p = 1, K_i=0, and K_d=0, it can be seen that this
       new_bw = old_bw*ratio

what is new_bw and old_bw here?

K_p = 1, K_i=K_d=0: \\ bwn =& bwo + bwo e(t) \\ bwn =& bwo + bwo \left(\frac{bwfilt_i}{\mu_{bwfilt}} - 1\right) \\ bwn =& bwo \frac{bwfilt_i}{\mu_{bwfilt}}

   compute an average of the filt_bw fields over all nodes we have measured.

   These averages are used to produce ratios for each node by dividing the
   measured value for that node by the network average. 

   These ratios are then multiplied by the most recent observed descriptor
   bandwidth we have available for each node, to produce a new value for
   the network status consensus process.
bw_i = strm_bw

In the scaling without PID controller

bwn_i =& max\left( \frac{bw_i}{\mu}, \frac{bwfilt_i}{\mu_{bwfilt}} \right) \times bwobs_i

So the called ratio is the same, but not what is considered the new bw.

   For purposes of calculating the integral and the derivative of the
   error, we assume units of time that correspond to feedback intervals,
   eliminating the need to track time for any other purpose other than
   determining when to report a measurement.

   The integral component, pid_error_sum is subjected to a decay factor
   per each interval, to prevent unbounded growth in cases without
   convergence.

   The differential component is a simple delta, calculating by
   subtracting the previous pid_error from the current pid_error.

For the integral component: how the decay factor works?

K_i \int _{0}^{t}{e(t)} = ?

For the differential component:

K_d \frac {d}{dt}{e(t)} = e(t_{i-1}) - e(t_{i})