Tor Hackweek Project: Prometheus alerts for anti-censorship metrics¶

Summary: We have BridgeDB exporting prometheus metrics so far, and we could implement this for Snowflake. It would be great if we could get alerts when usage changes to notify us of possible censorship events. Somewhat related, it would also be nice to get alerts when default bridge usage drops off suddenly or directly connecting Tor users from different regions.

Skills Needed: Maybe Go (for changes to snowflake), maybe Python for other services, some sysadmin experience to figure out how to do the alerts, metrics pipeline experience.

Team¶

anarcat (utc-4)
cecylia (cohosh) (UTC -4)
tara(?) (UTC +1)
agix(?) (UTC +1)

Main objectives¶

Documentation!
- documented what the prometheus2 server is doing
- document all of our anti-censorship alerts in one place (where?): not completed?
Expand our prometheus metrics for anti-censorship services
- export existing snowflake metrics for prometheus - in general see those guidelines for adding metrics
- add disk space/RAM/CPU monitoring for anti-censorship services: some of those are already covered for by TPA, on TPA machines. external services should be monitored explicitly: install the Prometheus node exporter (Debian package) and tell TPA which URL to scrape
- expand the metrics tor exports for Prometheus: not done?
Play around with prometheus alert rules to recognize both outages and trends
- tor exports prometheus data out of the metrics port now!
- we did some work on alerting
- we setup basic alerts on bridgestrap metrics to monitor bridges
Figure out where to send all of our alerts
- emails are sent to our existing anti-censorship alerts mailing list
- Make sure we're also noticing logged errors for our services (we currently only use those for debugging) - advice from anarcat: log analysis is hard and annoying; instead, export error- or warning-specific counters in metrics and do alerting on that, you can dig in the logs to see the exact errors afterwards