DAO curated node registry - 'Verified Operators'

BenAffleck · December 8, 2021, 12:16pm

I don’t know anything about operator node performance. Still, maybe we pick a number that (out of experience) requires a) a relatively expensive/powerful node or b) a multinode/load-balanced setup to maintain a high operator score. I don’t know how many validators a single operator node could potentially handle before performance decreases.
This would also prove that the operator can serve a larger number of validators (which will come in after being ‘verified’).

However, I admit that this seems more complex for now. I just wanted to raise this “concern.”

alonmuroch · December 8, 2021, 12:18pm

You’d probably need a few hundreds of validator to start stressing anything on a basic setup.
I think the bigger (initial) challenge is to setup the node and make sure it performs well even on a basic number of validators. Anything after that is scale.

BenAffleck · December 8, 2021, 12:20pm

Understood! Makes sense. Thx for the explanation.

SpookyG · December 8, 2021, 1:42pm

I agree with this proposal. Giving validators some certainty that at least a couple of their operators are verified is a great help!

I think including client choice in the operator score is a great idea as well.

Another way of doing it could be to directly incentivize validators to choose operators with a diverse set of clients through the operator SSV fee.
The base fee could be what the operator wants for their services and then we could add a fee on top as a “health” fee depending on which operators you’ve already chosen.

First prysm operator is cheap, second prysm is more expensive etc.

This would in turn incentivize operators into spinning up machines with clients that are “in demand” as they are cheaper to the validator.

operator diversity is another thing i think could be interesting to discuss but probably for a another time.

alonmuroch · December 9, 2021, 7:24am

Ultimately I think it’s the validator’s choice, keep in mind SSV will probably be used by devs more so than individual stakers.

Another point, it’s not easy to validate what operators actually running, if we create incentives for specific clients it will be much easier for them to simply “write” they use other clients then actually running them.

SpookyG · December 9, 2021, 11:50am

Makes sense Maybe in the future it will be easier to validate what client an operator is running with the help of some type of blockprint-like signature or something otherwise yeah operators will surely just spoof what they are running…

Yorick · December 10, 2021, 3:48pm

Thinking about security and operations, additional operator commitments may make sense.

Commitment to following security best practices. Good question whether this should be spelled out here or be a separate document, maintained by?? Things such as:
- Security updates for all components of the infrastructure (OS, node, execution, consensus) installed within 24 hours (this likely means unattended-upgrades for the OS)
- SSH auth restricted to key or 2FA, no plain user/password
- REST/WS/RPC APIs restricted to access from NOs own infrastructure, not “public”
- If consensus/execution are accessed via Internet by SSV node, TLS encryption for that traffic
- At-rest encryption of the SSV node’s DB storage, if it is with a cloud provider
Commitment to deploy hard fork updates to consensus / execution at least three (3) days before hard fork, if sufficient notice was given by EF / client teams
Commitment to deploy maintenance releases of all clients (node/consensus/execution) in a timely manner - 1 week?
Commitment to have only one storage provider for the SSV DB and slashing protection DB, and run only one ssv node at a time. If failover is desired, it must be handled by container orchestration and shared stateful storage. If storage is replicated (examples EFS, OnDat, ceph), it must prioritize data integrity over data availability - that is, in a split brain scenario, the storage that “split off” goes offline. The intent is to put guard rails around the risk of running the SSV node twice - so things like “custom replication scripts” should (must?) be avoided, ditto having failover modes that don’t use an orchestration framework, such as k8s or docker swarm mode.

Good question how prescriptive the DAO wants to get. Some level of “you need to be this tall” seems prudent.

Edouard · December 28, 2021, 3:20pm

I like the proposal and framework, lots of good comments around here too

I’ve added some comments and ideas below. Overall I think we can keep it simple and clear for now and improve later on.

Mechanics
2. The community should have at least 3 days for comments and modifications → a bit short in my mind too. I think 1 week to 10 days is more appropriate

Criteria
→ Should we add that the Verified node operator should have at least a website with basic information and contact details?

Appendix - Verified Operator status request template
→ Join the Discord and share your organization Discord IDs? Can be helpful for contact later on
→ Add a security/emergency contact email?

JHGrove3 · December 30, 2021, 11:46pm

Speaking as a DAppNode operator, I want to push the thinking back in the direction of simplicity.

If we require all the verified operators to run complex databases with replicated storage etc. it’s going to rule out all the smaller distributed operators.

Part of the beauty of the SSV model is that it requires at least 3 of the 4 operators to come to consensus before making an attestation, which means that it should be exceptionally rare for a validator to get slashed if just one of the four operators has a bad database problem.

If the SSV consensus layer works the way I think it should work, it should alleviate a lot of the concerns around a bad operator causing slashing.

But it does suggest to me that we should encourage users to select their operators from 4 totally separate pools. That way they don’t get slashed if one big operator has a database problem that impacts all the SSV operators in their farm.

Adamefr · December 31, 2021, 1:55am

I tend to agree, but you also want to be able to empower users to make optimal choices when it comes to their operators. Its not only about not getting slashed, also about optimizing performance and uptime and making sure your chosen operator is professional and responsive

Adamefr · December 31, 2021, 1:58am

Changed to 7 days
Website - not sure, there are a lot of great operators from Dappnodes for example which offer a hardware option (pretty unique) but are not a full blown company
Discord - I think that some sort of means of communication is in order. However, I dont want to limit operators to use the ssv.network’s Discord. Ill add the emergency contact part

Yorick · December 31, 2021, 1:25pm

Lose of status

DAO vote to remove your status

Operator score below X(TBD) over a period exceeding 2 weeks

Typo: Needs to be “Loss of status”
How about 90% for the operator score?

Maybe 2. should be “2. Operator score below X% *(TBD) over a period exceeding 2 weeks will trigger a DAO vote to remove status, see 1.”

contraband · December 31, 2021, 4:30pm

Should operators on cloud services divulge which service they are on so that you don’t get an AWS outage knocking out 3 of 4 nodes?

alonmuroch · January 2, 2022, 8:24am

Definitely! I think part of being a verified operator is exactly that.

alonmuroch · January 2, 2022, 8:27am

I thought about it a bit. I think there is a difference between testnet and mainnet in that regard.
One of the things I’ve put in my “path to mainnet” blog post was to create a verification framework which should be transparent and comprehensive. In that I think there is room to add a parameter that limits verified operator assigned shares as a way to quickly verify operators but limit the number of shares they can get at first, slowly raising it up the more that operator proves itself. Of course very well known operators can start with a higher limit.
This can give way for smaller operators to be verified and gradually make their setup more sophisticated the more shares they have.

For testnet I’d maybe go with a simpler approach.

Adamefr · January 3, 2022, 11:31am

I think that the removal should be automatic, if the DAO will need to vote on every downgrade it might take time and more users might get hurt by continuing to choose a faulty operator.

I added 90%, lets use that benchmark and see if its not contested

Typo fixed

Yorick · January 3, 2022, 12:19pm

Also to point out that AWS is not a monolithic service. For example, recent us-east-1 and us-west-1 outages didn’t touch my infrastructure in us-east-2.

Divulging service and service location is a good idea, I think.

Adamefr · January 4, 2022, 2:15am

Good point, added in item 4. LMK if the phrasing needs work

Yorick · January 4, 2022, 12:15pm

I am thinking this should be nuanced. A redphone for Blox staff, not public; and a way to get in touch for stakers, Discord or Telegram channel.

Adamefr · January 5, 2022, 7:50am

Redphone for Blox implies that we are coordinating the network and we want it to be decentralized. So stakers or other operators with an exposure to an offline node will work to alert the respective operator

Ill add the communication channels you suggested