In this article, some basics of the Bitcoin network protocol are examined through a practical lens by looking at a Bitcoin node crawler I developed. The node crawler, serpent, is implemented in python and is available on github.
Despite having been around for a while now, some parts of Bitcoin remain poorly documented. For experts, “the source code is the documentation.” But for people taking their first steps into Bitcoin, finding the corresponding piece of code and interpreting it correctly can be quite challenging, so uncertainties may remain. In many instances, examining a particular aspect in practice can help clear up these uncertainties. For that reason, this article combines the discussion of the protocol specification with its practical application in a node crawler.
Initial node discovery
Most network protocols follow the client-server paradigm, in
which multiple clients connect to a central server. Bitcoin, in
contrast, uses a so-called peer-to-peer network topology,
which means that nodes in the network are directly connected to
each other. Typically, a fresh node discovers peers to connect
to using hard-coded DNS seed nodes.
(see, e.g., here for Bitcoin core).
When queried, these seed nodes reply with a list of node IP
addresses. The practical application of this initial peer
discovery is demonstrated in the
function in the node crawler:
def bootstrap_nodelist(self): for seed in DNS_SEEDS: hosts = socket.getaddrinfo(seed, 53, proto=socket.IPPROTO_TCP) for host in hosts: self.pending.add(Address(host, 8333))
The loop in the second line iterates over all hard-coded seed
getaddrinfo function returns a list of
hosts for each of the seeds. Each list item returned by the
getaddrinfo function is actually a five-tuple,
containing different pieces of information
(see here for details).
For now, it is sufficient to know that the first
element in the fourth tuple (i.e.,
corresponds to a host's IP address. Each host's IP address is
paired with the default port of the Bitcoin protocol (8333),
and added to the
pending set. This is the set of
nodes that will later be visited by the node crawler.
Connecting to a Bitcoin node
Establishing a connection to a Bitcoin node involves a
four-way handshake. This handshake begins by the initiating
node sending a
version message to a desired peer.
version message contains several relevant
pieces of information, such as the node's supported protocol
version, its current time, IP addresses, current height of the
node's blockchain, and so on (see here
for a complete list). The peer then replies with a
verack message to acknowledge the receipt of the
version message. Next, the process is repeated in
the other direction: the peer sends the requesting node a
version message and the initiating node replies
verack message. Once those four messages
have been exchanged, the handshake is complete and the
connection to the node is established.
Regular node operation
At this point, a regular Bitcoin node examine the
version messages from its peers to determine the
height of their blockchains. In addition, nodes also exchange
getblocks messages, where each node communicates
the hash of the highest block in its blockchains. The node with
the longer blockchain will recognize the other node's hash
because it is already in its blockchain. The node with the
longer chain then sends one or more
(inventory) messages. Each
inv messages advertises
the hashes of 500 missing blocks to the node with the shorter
chain. The node receiving the
inv messages can
request missing blocks by sending
for the hashes it received.
Once the node is up to date, it can start its duty: receiving, validating, and relaying transactions and blocks. None of this, however, is relevant for a node crawler, so the details are not covered here.
Discovering additional nodes
Once a connection has been established, a node can be
queried for a history of its peers using the
getaddr message. The queried node will reply with
one or more
addr messages, each of which typically
contains around 1,000 entries. Each entry represents a
previously seen node and includes a timestamp, indicating when
the node was last seen, and the corresponding IP address and
port. Once a queried node has sent its history of nodes, it
simply stops sending
addr messages. The fact that
no more data will be sent is thus implicit, so at any given
point in time the requesting node can never know whether the
queried node will send more
addr messages in the
future. A simple heuristic for the requesting node to decide
whether the queried node is done sending
messages is to use some sort of timeout. The node crawler
implements the following strategy:
async def get_peers(self): self.send(GetAddrMessage()) new_nodes = set() start = time.time() while time.time() - start < 30: try: message = await self.wait_for(AddrMessage, timeout=10) except asyncio.TimeoutError: break new_nodes |= set(message.addresses) return new_nodes
The crawler begins by sending a
to a peer. The
while loop makes sure the node
does not receive
addr messages for more than
thirty seconds. This turned out to be a good heuristic, for
most peers finish sending their peer history in less than thirty
seconds. In addition, the maximum allowed interval between
addr messages is set to ten seconds
timeout=10 parameter in the call to
wait_for), so nodes that have finished sending
messages or went offline are detected in ten rather than thirty
Finding active nodes
With mechanisms of initial node discovery, connecting to nodes, and requesting peers' node histories covered, determining the set of active Bitcoin nodes is simple. The following is a (deliberately shortened) version of the function that implements the crawling logic.
async def crawler(self): while True: if not self.pending: return node = random.sample(self.pending, 1) self.pending.remove(node) try: connection = await self.connect(node) except (ConnectionRefusedError, ..., asyncio.TimeoutError): self.unreachable.add(node) continue self.active.add(node) peers = await connection.get_peers() await connection.close() for peer in peers: if peer in (self.active | self.unreachable | self.pending): continue if peer.timestamp < (time.time() - (24*60*60)): continue self.pending.add(peer)
crawler is executed, the
bootstrap_nodelist function is run to populate
self.pending with a set of nodes from the seed
nodes. The crawler keeps executing in a loop until no more
nodes are left in the set of pending nodes. Each loop iteration
starts with picking a random node from the set of pending
nodes. The crawler then tries to establish a connection to the
selected node using the previously discussed four-way handshake.
If a connection can be established, the node is considered
active, so it added to the set of active nodes; if the
connection fails, it is instead added to the set of unreachable
nodes. In the next step, the node is asked for its peer history
using the previously discussed
Finally, the connection is closed. Instead of naively adding
the peers received by the node to the set of pending nodes,
some filtering occurs: nodes that have been seen before
(and, therefore, must be in either of the active, unreachable,
or pending node sets) are discarded. The same applies to nodes
the peer has not seen in more than a day because there is a
good chance the node has since changed its IP address. All
remaining nodes from the peer's history are added to the set of
pending nodes. After this is done, the next loop iteration is
It turns out that most of the node crawler's execution time
is spent waiting for other node's replies over the network.
Instead of letting this time spent waiting go to waste, it can
be used to speed up the processing of the set of pending nodes
by running multiple crawlers in parallel. To this end, the
await keywords from
python's asynchronous I/O library (
asyncio) can be
used to make a crawler yield the CPU whenever it performs a
blocking operating (such as, e.g., waiting for a peer's reply).
This way, whenever a particular crawler yields the CPU, it enables another
crawler to fill the waiting time with useful work.
Another way to improve the crawler's execution time is to
not ask every node for its peer history. The rationale behind
this optimization is that there might be a certain degree of
overlap in peers' node histories. By only asking some
percentage, p, of nodes for their peer history,
runtime can be reduced significantly because most of it is
spent waiting for
addr messages. The figure below
compares the number of discovered nodes to the crawler's
runtime for different values of p using five runs
for each setting.
While very small values of p results in a very low number of discovered active nodes (e.g., p = 2%), the data indicates that a p of around 10% is sufficient to discover all active nodes. Compared to the naive implementation that asked all nodes for its peer history, this optimizations decreases the runtime by a factor of more than three.
If you found the information in this article useful, feel free to
If you have scientific, Bitcoin-related freelance work,
let me know.