Designing a Bitcoin node crawler

In this article, some basics of the Bitcoin network protocol are examined through a practical lens by looking at a Bitcoin node crawler I developed. The node crawler, serpent, is implemented in python and is available on github.

Motivation

Despite having been around for a while now, some parts of Bitcoin remain poorly documented. For experts, “the source code is the documentation.” But for people taking their first steps into Bitcoin, finding the corresponding piece of code and interpreting it correctly can be quite challenging, so uncertainties may remain. In many instances, examining a particular aspect in practice can help clear up these uncertainties. For that reason, this article combines the discussion of the protocol specification with its practical application in a node crawler.

Initial node discovery

Most network protocols follow the client-server paradigm, in which multiple clients connect to a central server. Bitcoin, in contrast, uses a so-called peer-to-peer network topology, which means that nodes in the network are directly connected to each other. Typically, a fresh node discovers peers to connect to using hard-coded DNS seed nodes. (see, e.g., here for Bitcoin core). When queried, these seed nodes reply with a list of node IP addresses. The practical application of this initial peer discovery is demonstrated in the bootstrap_nodes function in the node crawler:

def bootstrap_nodelist(self):
  for seed in DNS_SEEDS:
    hosts = socket.getaddrinfo(seed, 53, proto=socket.IPPROTO_TCP)
    for host in hosts:
      self.pending.add(Address(host[4][0], 8333))

The loop in the second line iterates over all hard-coded seed nodes. The getaddrinfo function returns a list of hosts for each of the seeds. Each list item returned by the getaddrinfo function is actually a five-tuple, containing different pieces of information (see here for details). For now, it is sufficient to know that the first element in the fourth tuple (i.e., host[4][0]) corresponds to a host's IP address. Each host's IP address is paired with the default port of the Bitcoin protocol (8333), and added to the pending set. This is the set of nodes that will later be visited by the node crawler.

Connecting to a Bitcoin node

Establishing a connection to a Bitcoin node involves a four-way handshake. This handshake begins by the initiating node sending a version message to a desired peer. This version message contains several relevant pieces of information, such as the node's supported protocol version, its current time, IP addresses, current height of the node's blockchain, and so on (see here for a complete list). The peer then replies with a verack message to acknowledge the receipt of the version message. Next, the process is repeated in the other direction: the peer sends the requesting node a version message and the initiating node replies with a verack message. Once those four messages have been exchanged, the handshake is complete and the connection to the node is established.

Regular node operation

At this point, a regular Bitcoin node examine the version messages from its peers to determine the height of their blockchains. In addition, nodes also exchange getblocks messages, where each node communicates the hash of the highest block in its blockchains. The node with the longer blockchain will recognize the other node's hash because it is already in its blockchain. The node with the longer chain then sends one or more inv (inventory) messages. Each inv messages advertises the hashes of 500 missing blocks to the node with the shorter chain. The node receiving the inv messages can request missing blocks by sending getdata messages for the hashes it received.

Once the node is up to date, it can start its duty: receiving, validating, and relaying transactions and blocks. None of this, however, is relevant for a node crawler, so the details are not covered here.

Discovering additional nodes

Once a connection has been established, a node can be queried for a history of its peers using the getaddr message. The queried node will reply with one or more addr messages, each of which typically contains around 1,000 entries. Each entry represents a previously seen node and includes a timestamp, indicating when the node was last seen, and the corresponding IP address and port. Once a queried node has sent its history of nodes, it simply stops sending addr messages. The fact that no more data will be sent is thus implicit, so at any given point in time the requesting node can never know whether the queried node will send more addr messages in the future. A simple heuristic for the requesting node to decide whether the queried node is done sending addr messages is to use some sort of timeout. The node crawler implements the following strategy:

async def get_peers(self):
  self.send(GetAddrMessage())
  new_nodes = set()
  start = time.time()
  while time.time() - start < 30:
    try:
      message = await self.wait_for(AddrMessage, timeout=10)
      except asyncio.TimeoutError:
        break
      new_nodes |= set(message.addresses)
  return new_nodes

The crawler begins by sending a getaddr message to a peer. The while loop makes sure the node does not receive addr messages for more than thirty seconds. This turned out to be a good heuristic, for most peers finish sending their peer history in less than thirty seconds. In addition, the maximum allowed interval between consecutive addr messages is set to ten seconds (cf. the timeout=10 parameter in the call to wait_for), so nodes that have finished sending messages or went offline are detected in ten rather than thirty seconds.

Finding active nodes

With mechanisms of initial node discovery, connecting to nodes, and requesting peers' node histories covered, determining the set of active Bitcoin nodes is simple. The following is a (deliberately shortened) version of the function that implements the crawling logic.

async def crawler(self):
  while True:
    if not self.pending:
      return
    node = random.sample(self.pending, 1)[0]
    self.pending.remove(node)
    try:
      connection = await self.connect(node)
    except (ConnectionRefusedError, ..., asyncio.TimeoutError):
      self.unreachable.add(node)
      continue
    self.active.add(node)
    peers = await connection.get_peers()
    await connection.close()
    for peer in peers:
      if peer in (self.active | self.unreachable | self.pending):
        continue
      if peer.timestamp < (time.time() - (24*60*60)):
        continue
      self.pending.add(peer)

Before crawler is executed, the bootstrap_nodelist function is run to populate self.pending with a set of nodes from the seed nodes. The crawler keeps executing in a loop until no more nodes are left in the set of pending nodes. Each loop iteration starts with picking a random node from the set of pending nodes. The crawler then tries to establish a connection to the selected node using the previously discussed four-way handshake. If a connection can be established, the node is considered active, so it added to the set of active nodes; if the connection fails, it is instead added to the set of unreachable nodes. In the next step, the node is asked for its peer history using the previously discussed get_peers function. Finally, the connection is closed. Instead of naively adding the peers received by the node to the set of pending nodes, some filtering occurs: nodes that have been seen before (and, therefore, must be in either of the active, unreachable, or pending node sets) are discarded. The same applies to nodes the peer has not seen in more than a day because there is a good chance the node has since changed its IP address. All remaining nodes from the peer's history are added to the set of pending nodes. After this is done, the next loop iteration is executed.

Performance optimization

It turns out that most of the node crawler's execution time is spent waiting for other node's replies over the network. Instead of letting this time spent waiting go to waste, it can be used to speed up the processing of the set of pending nodes by running multiple crawlers in parallel. To this end, the async and await keywords from python's asynchronous I/O library (asyncio) can be used to make a crawler yield the CPU whenever it performs a blocking operating (such as, e.g., waiting for a peer's reply). This way, whenever a particular crawler yields the CPU, it enables another crawler to fill the waiting time with useful work.

Another way to improve the crawler's execution time is to not ask every node for its peer history. The rationale behind this optimization is that there might be a certain degree of overlap in peers' node histories. By only asking some percentage, p, of nodes for their peer history, runtime can be reduced significantly because most of it is spent waiting for addr messages. The figure below compares the number of discovered nodes to the crawler's runtime for different values of p using five runs for each setting.

Some graph

While very small values of p results in a very low number of discovered active nodes (e.g., p = 2%), the data indicates that a p of around 10% is sufficient to discover all active nodes. Compared to the naive implementation that asked all nodes for its peer history, this optimizations decreases the runtime by a factor of more than three.

If you found the information in this article useful, feel free to contribute: 16pGpaoAhzoneLdRdxPSo9xAAPhzWnP2dA. If you have scientific, Bitcoin-related freelance work, let me know.