简介

是一种强一致性的分布式键值存储，它提供了一种可靠的方式来存储需要由分布式系统或机器集群访问的数据。它可以在网络分区期间优雅地处理领导者选举（Raft）

简单接口

读写使用标准HTTP工具，例如CURL

键值存储

将数据存储在分层组织的目录中，如标准文件系统中

观察变化

监视特定键或目录的更改，并对值的更改做出反应

安装

1.获取二进制文件

1	`wget https://github.com/etcd-io/etcd/releases/download/v3.5.0/etcd-v3.5.0-linux-arm64.tar.gz`

2.解压

1	`tar zxvf etcd-v3.5.0-linux-arm64.tar.gz`

3.移动二进制文件到指定目录

1
2
3

mkdir /usr/local/bin/etcd
cd zxvf etcd-v3.5.0-linux-arm64
mv mv etcd etcdctl etcdutl /usr/local/bin/etcd

4.查看帮助文档

Usage:

  etcd [flags]
    Start an etcd server.

  etcd --version
    Show the version of etcd.

  etcd -h | --help
    Show the help information about etcd.

  etcd --config-file
    Path to the server configuration file. Note that if a configuration file is provided, other command line flags and environment variables will be ignored.

  etcd gateway
    Run the stateless pass-through etcd TCP connection forwarding proxy.

  etcd grpc-proxy
    Run the stateless etcd v3 gRPC L7 reverse proxy.


Member:
  --name 'default'
    Human-readable name for this member.
  --data-dir '${name}.etcd'
    Path to the data directory.
  --wal-dir ''
    Path to the dedicated wal directory.
  --snapshot-count '100000'
    Number of committed transactions to trigger a snapshot to disk.
  --heartbeat-interval '100'
    Time (in milliseconds) of a heartbeat interval.
  --election-timeout '1000'
    Time (in milliseconds) for an election to timeout. See tuning documentation for details.
  --initial-election-tick-advance 'true'
    Whether to fast-forward initial election ticks on boot for faster election.
  --listen-peer-urls 'http://localhost:2380'
    List of URLs to listen on for peer traffic.
  --listen-client-urls 'http://localhost:2379'
    List of URLs to listen on for client traffic.
  --max-snapshots '5'
    Maximum number of snapshot files to retain (0 is unlimited).
  --max-wals '5'
    Maximum number of wal files to retain (0 is unlimited).
  --quota-backend-bytes '0'
    Raise alarms when backend size exceeds the given quota (0 defaults to low space quota).
  --backend-bbolt-freelist-type 'map'
    BackendFreelistType specifies the type of freelist that boltdb backend uses(array and map are supported types).
  --backend-batch-interval ''
    BackendBatchInterval is the maximum time before commit the backend transaction.
  --backend-batch-limit '0'
    BackendBatchLimit is the maximum operations before commit the backend transaction.
  --max-txn-ops '128'
    Maximum number of operations permitted in a transaction.
  --max-request-bytes '1572864'
    Maximum client request size in bytes the server will accept.
  --grpc-keepalive-min-time '5s'
    Minimum duration interval that a client should wait before pinging server.
  --grpc-keepalive-interval '2h'
    Frequency duration of server-to-client ping to check if a connection is alive (0 to disable).
  --grpc-keepalive-timeout '20s'
    Additional duration of wait before closing a non-responsive connection (0 to disable).
  --socket-reuse-port 'false'
    Enable to set socket option SO_REUSEPORT on listeners allowing rebinding of a port already in use.
  --socket-reuse-address 'false'
        Enable to set socket option SO_REUSEADDR on listeners allowing binding to an address in TIME_WAIT state.

Clustering:
  --initial-advertise-peer-urls 'http://localhost:2380'
    List of this member's peer URLs to advertise to the rest of the cluster.
  --initial-cluster 'default=http://localhost:2380'
    Initial cluster configuration for bootstrapping.
  --initial-cluster-state 'new'
    Initial cluster state ('new' or 'existing').
  --initial-cluster-token 'etcd-cluster'
    Initial cluster token for the etcd cluster during bootstrap.
    Specifying this can protect you from unintended cross-cluster interaction when running multiple clusters.
  --advertise-client-urls 'http://localhost:2379'
    List of this member's client URLs to advertise to the public.
    The client URLs advertised should be accessible to machines that talk to etcd cluster. etcd client libraries parse these URLs to connect to the cluster.
  --discovery ''
    Discovery URL used to bootstrap the cluster.
  --discovery-fallback 'proxy'
    Expected behavior ('exit' or 'proxy') when discovery services fails.
    "proxy" supports v2 API only.
  --discovery-proxy ''
    HTTP proxy to use for traffic to discovery service.
  --discovery-srv ''
    DNS srv domain used to bootstrap the cluster.
  --discovery-srv-name ''
    Suffix to the dns srv name queried when bootstrapping.
  --strict-reconfig-check 'true'
    Reject reconfiguration requests that would cause quorum loss.
  --pre-vote 'true'
    Enable to run an additional Raft election phase.
  --auto-compaction-retention '0'
    Auto compaction retention length. 0 means disable auto compaction.
  --auto-compaction-mode 'periodic'
    Interpret 'auto-compaction-retention' one of: periodic|revision. 'periodic' for duration based retention, defaulting to hours if no time unit is provided (e.g. '5m'). 'revision' for revision number based retention.
  --enable-v2 'false'
    Accept etcd V2 client requests. Deprecated and to be decommissioned in v3.6.
  --v2-deprecation 'not-yet'
    Phase of v2store deprecation. Allows to opt-in for higher compatibility mode.
    Supported values:
      'not-yet'                // Issues a warning if v2store have meaningful content (default in v3.5)
      'write-only'             // Custom v2 state is not allowed (planned default in v3.6)
      'write-only-drop-data'   // Custom v2 state will get DELETED !
      'gone'                   // v2store is not maintained any longer. (planned default in v3.7)

Security:
  --cert-file ''
    Path to the client server TLS cert file.
  --key-file ''
    Path to the client server TLS key file.
  --client-cert-auth 'false'
    Enable client cert authentication.
  --client-crl-file ''
    Path to the client certificate revocation list file.
  --client-cert-allowed-hostname ''
    Allowed TLS hostname for client cert authentication.
  --trusted-ca-file ''
    Path to the client server TLS trusted CA cert file.
  --auto-tls 'false'
    Client TLS using generated certificates.
  --peer-cert-file ''
    Path to the peer server TLS cert file.
  --peer-key-file ''
    Path to the peer server TLS key file.
  --peer-client-cert-auth 'false'
    Enable peer client cert authentication.
  --peer-trusted-ca-file ''
    Path to the peer server TLS trusted CA file.
  --peer-cert-allowed-cn ''
    Required CN for client certs connecting to the peer endpoint.
  --peer-cert-allowed-hostname ''
    Allowed TLS hostname for inter peer authentication.
  --peer-auto-tls 'false'
    Peer TLS using self-generated certificates if --peer-key-file and --peer-cert-file are not provided.
  --self-signed-cert-validity '1'
    The validity period of the client and peer certificates that are automatically generated by etcd when you specify ClientAutoTLS and PeerAutoTLS, the unit is year, and the default is 1.
  --peer-crl-file ''
    Path to the peer certificate revocation list file.
  --cipher-suites ''
    Comma-separated list of supported TLS cipher suites between client/server and peers (empty will be auto-populated by Go).
  --cors '*'
    Comma-separated whitelist of origins for CORS, or cross-origin resource sharing, (empty or * means allow all).
  --host-whitelist '*'
    Acceptable hostnames from HTTP client requests, if server is not secure (empty or * means allow all).

Auth:
  --auth-token 'simple'
    Specify a v3 authentication token type and its options ('simple' or 'jwt').
  --bcrypt-cost 10
    Specify the cost / strength of the bcrypt algorithm for hashing auth passwords. Valid values are between 4 and 31.
  --auth-token-ttl 300
    Time (in seconds) of the auth-token-ttl.

Profiling and Monitoring:
  --enable-pprof 'false'
    Enable runtime profiling data via HTTP server. Address is at client URL + "/debug/pprof/"
  --metrics 'basic'
    Set level of detail for exported metrics, specify 'extensive' to include server side grpc histogram metrics.
  --listen-metrics-urls ''
    List of URLs to listen on for the metrics and health endpoints.

Logging:
  --logger 'zap'
    Currently only supports 'zap' for structured logging.
  --log-outputs 'default'
    Specify 'stdout' or 'stderr' to skip journald logging even when running under systemd, or list of comma separated output targets.
  --log-level 'info'
    Configures log level. Only supports debug, info, warn, error, panic, or fatal.
  --enable-log-rotation 'false'
    Enable log rotation of a single log-outputs file target.
  --log-rotation-config-json '{"maxsize": 100, "maxage": 0, "maxbackups": 0, "localtime": false, "compress": false}'
    Configures log rotation if enabled with a JSON logger config. MaxSize(MB), MaxAge(days,0=no limit), MaxBackups(0=no limit), LocalTime(use computers local time), Compress(gzip)". 

Experimental distributed tracing:
  --experimental-enable-distributed-tracing 'false'
    Enable experimental distributed tracing.
  --experimental-distributed-tracing-address 'localhost:4317'
    Distributed tracing collector address.
  --experimental-distributed-tracing-service-name 'etcd'
    Distributed tracing service name, must be same across all etcd instances.
  --experimental-distributed-tracing-instance-id ''
    Distributed tracing instance ID, must be unique per each etcd instance.

v2 Proxy (to be deprecated in v3.6):
  --proxy 'off'
    Proxy mode setting ('off', 'readonly' or 'on').
  --proxy-failure-wait 5000
    Time (in milliseconds) an endpoint will be held in a failed state.
  --proxy-refresh-interval 30000
    Time (in milliseconds) of the endpoints refresh interval.
  --proxy-dial-timeout 1000
    Time (in milliseconds) for a dial to timeout.
  --proxy-write-timeout 5000
    Time (in milliseconds) for a write to timeout.
  --proxy-read-timeout 0
    Time (in milliseconds) for a read to timeout.

Experimental feature:
  --experimental-initial-corrupt-check 'false'
    Enable to check data corruption before serving any client/peer traffic.
  --experimental-corrupt-check-time '0s'
    Duration of time between cluster corruption check passes.
  --experimental-enable-v2v3 ''
    Serve v2 requests through the v3 backend under a given prefix. Deprecated and to be decommissioned in v3.6.
  --experimental-enable-lease-checkpoint 'false'
    ExperimentalEnableLeaseCheckpoint enables primary lessor to persist lease remainingTTL to prevent indefinite auto-renewal of long lived leases.
  --experimental-compaction-batch-limit 1000
    ExperimentalCompactionBatchLimit sets the maximum revisions deleted in each compaction batch.
  --experimental-peer-skip-client-san-verification 'false'
    Skip verification of SAN field in client certificate for peer connections.
  --experimental-watch-progress-notify-interval '10m'
    Duration of periodical watch progress notification.
  --experimental-warning-apply-duration '100ms'
        Warning is generated if requests take more than this duration.
  --experimental-txn-mode-write-with-shared-buffer 'true'
    Enable the write transaction to use a shared buffer in its readonly check operations.
  --experimental-bootstrap-defrag-threshold-megabytes
    Enable the defrag during etcd server bootstrap on condition that it will free at least the provided threshold of disk space. Needs to be set to non-zero value to take effect.

Unsafe feature:
  --force-new-cluster 'false'
    Force to create a new one-member cluster.
  --unsafe-no-fsync 'false'
    Disables fsync, unsafe, will cause data loss.

CAUTIOUS with unsafe flag! It may break the guarantees given by the consensus protocol!

使用

启动服务

启动服务
1
nohup etcd >/dev/null 2>&1 &
通过ctl来与etcd交互
1
etcdctl help

查看节点状态

[root@k8s-node03 ~]# etcdctl endpoint status --write-out=table
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT    |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 127.0.0.1:2379 | 8e9e05c52164694d |   3.5.5 |   20 kB |      true |      false |         4 |          9 |                  9 |        |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
[root@k8s-node03 ~]#

插入和获取

[root@k8s-node03 ~]# etcdctl put mk "hello world"
OK
[root@k8s-node03 ~]# etcdctl get mk
mk
hello world

前缀读取

1	`etcdctl get gsrde/org/hello-service/ --prefix`

监听key变化

1	`etcdctl watch key1`

分布式锁

1	`etcdctl lock mutex1`

事务

类库

https://etcd.io/docs/v3.5/integrations/

1.下载依赖

1	`go get go.etcd.io/etcd/client/v3`

2.启动服务

1	`nohup etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379 > /dev/null 2>&1 &`

3.编写代码测试

package main

import (
	"context"
	"fmt"
	clientv3 "go.etcd.io/etcd/client/v3"
	"log"
	"time"
)

const timeout = 5 * time.Second

var cli *clientv3.Client

func main() {
	initialize()
	Put("mykey", "simple value ")
	Get("mykey")
	Get("mk")
	Do()
	defer cli.Close()
}

func initialize() {
	var err error
	cli, err = clientv3.New(clientv3.Config{
		Endpoints:   []string{"192.168.100.24:2379"},
		DialTimeout: timeout,
	})
	if err != nil {
		log.Fatalf("error : %v", err)
	}
}

func Get(key string) (ret []byte) {
	ctx, cancel := context.WithTimeout(context.Background(), timeout)
	resp, err := cli.Get(ctx, key)
	cancel()
	if err != nil {
		log.Fatalf("error : %v", err)
	}
	for _, ev := range resp.Kvs {
		fmt.Printf("%s : %s\n", ev.Key, ev.Value)
		ret = ev.Value
		break
	}
	return
}

func Put(key, val string) {
	ctx, cancel := context.WithTimeout(context.Background(), timeout)
	resp, err := cli.Put(ctx, key, val)
	cancel()
	if err != nil {
		log.Fatalf("error : %v", err)
	}
	fmt.Println(resp)
}

// Do 在创建任意操作时很有用
func Do() {
	ops := []clientv3.Op{
		clientv3.OpPut("put-key", "123"),
		clientv3.OpGet("put-key"),
		clientv3.OpGet("put-key"),
		clientv3.OpPut("put-key", "456"),
		clientv3.OpPut("aaa", "bbbbbb"),
		clientv3.OpGet("aaa"),
	}
	for _, op := range ops {
		if resp, err := cli.Do(context.TODO(), op); err != nil {
			log.Fatal(err)
		} else {
			fmt.Println(resp.Get())
		}
	}
}

集群

Raft选举算法

Raft透过选举领袖（leader）的方式做共识算法。

名词

节点类型：主节点（leader）、从节点（Follower）、候选者（candidate）正常情况下只会有一个leader，leader负责所有外部请求，如果不是leader的节点接收到请求会转给leader处理。
心跳：主节点会在固定时间内节点向其他节点发送消息，让从节点知道主节点正常，如果在一定时间内收不到消息，集群会进入选举状态。
任期：当集群leader节点宕机，需要进行选举，此时进入新的任期（term）。

复制状态机

在多个节点上相同的初始化状态，在按序执行相同的输入后，返回相同的结果

在Raft中，leader将客户端请求（command）封装到一个个log entry中，将这些log entries复制到所有follower节点，然后大家按相同顺序应用log entries中 command，根据复制状态机的理论，所有节点的结束状态肯定是一致的。

请求选举RPC

Raft算法中服务器节点之间使用RPC进行通信，并且Raft中只有两种主要的RPC:

RequestVoteRPC（请求投票）：由candidate在选举期间发起。
**AppendEntriesRPC(追加条目)**：由leader发起，用来复制日志和提供一种心跳机制。

领导人选举

选举是由候选人发动的。当领袖的心跳超时的时候，追随者就会把自己的任期编号（term counter）加一、宣告竞选（将自己变成candidate）、投自己一票、并向其他服务器拉票。每个服务器在每个任期只会投一票，固定投给最早拉票的服务器。

在投票后可能会出现的三种情况

当前节点获得的投票数过半，当前节点作为leader并向其他follower发送心跳包。
其他节点赢得选举，当前节点收到心跳后判断其任期号是否大于等于当前节点任期号，如果满足当前节点candidate从进入follower状态。
选举一段时间后没有获胜者（没有投票数过半的），每个候选节点将进入一个随机选举超时时间（150-300ms），进入下一局选举选举。

Raft每个服务器的超时期限是随机的，这降低伺服务同时竞选的几率，也降低因两个竞选人得票都不过半而选举失败的几率

日志复制

一个任期号用一种颜色表示
Leader 并行发送 AppendEntries RPC给 follower，让它们复制该条目。当该条目被走过半数的follower复制后，leader就可以在本地执行该指令并把结果返回客户端。我们把本地执行指令，也就是leader应用日志与状态机这一步，称作提交。

在日志复制的过程中可能会出现leader或者follower宕机或者延迟的情况，所以raft需要一系列的机制来支持日志的复制，保证日志的复制的顺序一致性

follower宕机或者延迟响应：此时leader会重复的发送AppendEntries RPC给 follower，即使是leader已经响应了client。
follower宕机恢复：此时或做raft一致性检查，保证follower同步崩溃后缺失的日志

raft一致性检查：leader在发往每个follower的AppendEntries RPC中会加入前一个日志的索引号和任期号，如果follower在它的日志中找不到前一个日志，那么它就会拒绝此日志，leader收到follower的拒绝后，会发送前一个日志条且，从而逐渐向前定位到follower第一个缺失的日志。

当然你可能说这种逐个定位的方式效率并不高，为何follower不直接返回最后的日志后给leader，然后leader直接查找返回其后面一个日志，这种方式是可行的，但是在实践中，认为这种优化是没有必要的，因为失败不经常发生并且也不可能有很多不一致的日志条目。

leader宕机：此时leader可能已经向部分follower发送日志，但是leader还没有提交就已经宕机了，而新一轮选出来的leader并没有前者的日志，导致部分follower中的日志和和新leader日志不相同。在这种情况下 raft 会强制follower复制新leader的日志来拒绝，这就意味着follower的冲突日志被新leader日志覆盖。因为没有提交，因此没有违法外部一致性。对于follower而言,接收到了leader的日志,并不能立即提交,因为这时候还没有确认这个日志是否被复制到了大多数节点，所以follower只能是等待下一个AppendEntries RPC：心跳或者新日志，才能完成提交。

注意：leader从来不会覆盖或者删除自己的日志条目 (Append-Only)

安全

为了保证每一个状态机都按照相同的顺序去执行相同的命令，还存在很多边界问题，Raft还添加以下一些限制

Leader宕机处理：选举限制

如果两份日志最后条目的任期号不同，那么任期号大的日志更“新”。 **
**如果两份日志最后条目的任期号相同，那么日志较长的那个更“新”。
选民只会投票给任期比自己大，最后一条日志比自己新( 任期大于或者等于时索引更大)的候选人。

Leader宕机处理：新leader是否提交之前任期内的日志条目

如果某个leader在提交某个日志条目之前崩溃了，以后的leader会试图完成该日志条目的复制。

leader为何无法使用旧任期的日志条目确定提交的时间序列

（a）S1是leader；部分节点复制索引处的日志条目（复制到s2并提交了）。
（b）S1发生宕机；S5以选票当选第3任期领导人从S3、S4和其自身获取选票。产生日志但还未来得及发送就宕机了。
（c）S5发生宕机；此时S5未发生任何同步日志，S1重新启动，当选领导人继续复制日志。此时，任期2的日志条目已在大多数服务器上复制，但不是提交。
（d）S1发生宕机；S5可以当选为领导人（由S2、S3和S4投票因为S5任期号最高）并用它自己从第3任期开始发送日志进行同步。此时因为之前S1同步日志2时未提交就已经宕机了，即使是大多数服务器上已经复制日志，S5在同步日志过程中会把日志2覆盖掉。就造成原本应该被提交的日志被覆盖了。
（e）为了解决以上问题，增加了一个额外的限制：要求Leader在当前任期至少有一条日志被提交，即被超过半数的节点写盘。如（e）S1作为Leader．在崩溃之前，将3号位置的日志（任期号为4）在大多数节点上复制了一条日志条目（指的是条目3，term 4），那么即使这时S1宕机了，S5也不可能赢得选举—因为S2和S3最新日志条目的任期号为4，比S5的3要大，S3无法获得超过半数的选。“无法赢得选举，这就意味着2号位置的日志条目不会被覆写。

所以新上任的领导者在接受客户端写入命令之前需要提交一个no-op(空命令)，携带自己任期号的日志复制到大多数集群节点上才能真正的保证选举限制的成立。

Follower和Candidate宕机处理

若follower和candidate宕机，此时leader方式的日志或者请求投票的RPC都会失败，但leader的日志通过无限重试的方式发送，如果该节点重新启动了就可接收到。

时间与可用性限制

广播时间 broadcastTime ≪ 选举超时时间 electionTimeout ≪ 平均故障时间 MTBF

图文演示

https://raft.github.io/raftscope-replay/

集群成员变更

三节点扩容为五节点出现的脑裂问题
在新增集群节点的过程中可能会出现新老配置共存的情况，导致选举出多个leader，出现了脑裂问题。

Raft的解决方法是使用一种二阶段的方法

集群先切换到一个过渡的配置，称之为联合一致(joint consensus) 。(这样我们只需要关注怎样避免在联合一致状态发生脑裂问题就可以了。)

而配置信息作为一个日志体包装为一个普通的AppendEntries RPC,发送给所有的follower。

//追加日志RPC Request
type AppendEntriesRequest struct {
    term			int				//自己当前的任期号
    leaderld 		int				//leader(也就是自己)的ID,告诉follower自己是谁
    prevLogIndex 	int 			//前一个日志的日志号		用于进行一致性检查
    prevLogTerm 	int 			//前一个日志的任期号		用于进行一致性检查,只有这两个都与follower中的相同,follower才会认为日志是一致的
    //如果只有日志号相同,这可能就是上图中f的情况,依旧需要向前回溯
    entries 		[]byte			//当前日志体,也就是命令内容
    leaderCommit 	int				//leader的已提交日志号
}