FreelistManager for BlueStore Source Analysis


BlueStore manages bare devices directly and needs to manage space allocation and release on its own.The results of Stupid and Bitmap allocators are stored in memory, and the persistence of allocation results is done through FreelistManager.

A block can be in either occupied or idle state. Only one state needs to be recorded for persistence, and another state can be deduced. BlueStore records an idle block.There are two main reasons: one is to facilitate the merging of free space when space is reclaimed; the other is that the allocated space is already recorded in Object.

FreelistManager originally had two implementations, extension and bitmap, and now defaults to a bitmap implementation, which is obsolete.Free space persisted to disk is also written through RocksDB's Batch.FreelistManager makes up blocks by a certain number of segments, one k/v key-value pair for each segment, the key is the offset of the first block in the physical address space on disk, and the value is the state of each block in the segment, that is, the bitmap consisting of 0/1, 1 is idle, and 0 is used. This unifies the allocation and recycling of space operations by XOR with 1.


<a name="chapter1"></a>Common Interface

The primary interfaces for FreelistManager are allocator and release.

virtual void allocate(
	uint64_t offset, uint64_t length,
	KeyValueDB::Transaction txn) = 0;

virtual void release(
    uint64_t offset, uint64_t length,
    KeyValueDB::Transaction txn) = 0;

<a name="chapter2"></a>Data structure

class BitmapFreelistManager : public FreelistManager {
	// rocksdb key prefix: meta_prefix is B, bitmap_prefix is b
	std::string meta_prefix, bitmap_prefix;
	// Pointer to key-value DB encapsulating the operation of rocksdb
	KeyValueDB *kvdb;
	// merge operation for rocksdb: bitwise exclusive or (xor)
	ceph::shared_ptr<KeyValueDB::MergeOperator> merge_op;
	// Lock during enumerate operation
	std::mutex lock;

	// Total device size
	uint64_t size;
	// Total number of block s on device
	uint64_t blocks;

	// Block size: bdev_block_size, default min_alloc_size
	uint64_t bytes_per_block;
	// How many block s per key, default 128
	uint64_t blocks_per_key;
	// Space size for each key
	uint64_t bytes_per_key;

	// block Mask
	uint64_t block_mask;
	// key mask
	uint64_t key_mask;

	bufferlist all_set_bl;

	// Traversing rocksdb key related members
	KeyValueDB::Iterator enumerate_p;
	uint64_t enumerate_offset;
	bufferlist enumerate_bl;
	int enumerate_bl_pos; 

<a name="chapter3"></a>Initialization

When BlueStore initializes osd, it executes mkfs, initializes FreelistManager(create/init), and subsequently mount s if the process is restarted, only inits FreelistManager.

int BlueStore::mkfs()
	r = _open_fm(true);

int BlueStore::_open_fm(bool create)
	fm = FreelistManager::create(cct, freelist_type, db, PREFIX_ALLOC);

	// First initialization, curing meta-parameters required
	if (create) {
		fm->create(bdev->get_size(), min_alloc_size, t);

	int r = fm->init(bdev->get_size());

// create cures some meta-parameters into kvdb and reads them from kvdb when init
int BitmapFreelistManager::create(uint64_t new_size, uint64_t min_alloc_size,
		KeyValueDB::Transaction txn)
	txn->set(meta_prefix, "bytes_per_block", bl); // min_alloc_size
	txn->set(meta_prefix, "blocks_per_key", bl); // 128
	txn->set(meta_prefix, "blocks", bl);
	txn->set(meta_prefix, "size", bl);

// create/init calls the following function to initialize the mask of the block/key
void BitmapFreelistManager::_init_misc()
	// 128 >> 3 = 16, one bit for each block.
	// That is, the value of a key corresponds to 128 block s, requiring 16 bytes.
	bufferptr z(blocks_per_key >> 3); 
	memset(z.c_str(), 0xff, z.length());
	// 0x FFFF FFFF FFFF F000
	block_mask = ~(bytes_per_block - 1); 
	bytes_per_key = bytes_per_block * blocks_per_key;
	// 0xFFFF FFFF FFF8 0000
	key_mask = ~(bytes_per_key - 1);

<a name="chapter4"></a>Merge

XOR Merge interface implementation:

// Inherit rocksdb merge interface: XOR operation (xor)
struct XorMergeOperator : public KeyValueDB::MergeOperator {
	 // If old_value does not exist, the new_value is assigned to rdata directly.
    void merge_nonexistent(const char *rdata, size_t rlen,
                           std::string *new_value) override {
        *new_value = std::string(rdata, rlen);
    // If old_value exists, it is bit by bit exclusive from rdata or xor.
    void merge(const char *ldata, size_t llen, const char *rdata, size_t rlen,
               std::string *new_value) override {
        assert(llen == rlen);
        *new_value = std::string(ldata, llen);
        for (size_t i = 0; i < rlen; ++i) {
            (*new_value)[i] ^= rdata[i];
    // We use each operator name and each prefix to construct the
    // overall RocksDB operator name for consistency check at open time.
    string name() const override { return "bitwise_xor"; }

XOR Merge interface applications:

bool Merge(const rocksdb::Slice& key,
	     const rocksdb::Slice* existing_value,
	     const rocksdb::Slice& value,
	     std::string* new_value,
	     rocksdb::Logger* logger) const override 
	// for default column family
	// extract prefix from key and compare against each registered merge op;
	// even though merge operator for explicit CF is included in merge_ops,
	// it won't be picked up, since it won't match.
	for (auto& p : store.merge_ops) {
		if (, p.first.length(),, p.first.length()) == 0 &&[p.first.length()] == 0) {
			// If old_value exists, merge directly, otherwise replace directly.
			if (existing_value) {
				p.second->merge(existing_value->data(), existing_value->size(),, value.size(),
			} else {
				p.second->merge_nonexistent(, value.size(), new_value);
	return true;

The Mage method of Rocksdb's Batch is finally called.Batch implements atomic operations for simple write and conditional write.

<a name="chapter5"></a>Allocate

Allocation and release of space are exactly the same operations, both calling XOR operations, and we focus on the _xor function.

void BitmapFreelistManager::allocate(uint64_t offset, uint64_t length, KeyValueDB::Transaction txn)
	_xor(offset, length, txn);

void BitmapFreelistManager::release(uint64_t offset, uint64_t length, KeyValueDB::Transaction txn)
	_xor(offset, length, txn);

void BitmapFreelistManager::_xor(uint64_t offset, uint64_t length, KeyValueDB::Transaction txn)
	// Note that offset and length are aligned at the block boundary
	uint64_t first_key = offset & key_mask;
	uint64_t last_key = (offset + length - 1) & key_mask;

	if (first_key == last_key) { // The simplest case, which corresponds to a segment
		bufferptr p(blocks_per_key >> 3); // 16 byte buffer; // Set to All 0
		unsigned s = (offset & ~key_mask) / bytes_per_block; // Number of start block in paragraph
		unsigned e = ((offset + length - 1) & ~key_mask) / bytes_per_block; // Number of end block in paragraph

		for (unsigned i = s; i <= e; ++i) { // Generate mask for this operation
			p[i >> 3] ^= 1ull << (i & 7); // I >> 3 locates the bytes of the corresponding bit for the block, 1ull<< (i&7) locates the bit, and then XOR sets the bit to bit 1

		string k;
		make_offset_key(first_key, &k); // Convert memory content to 16-digit characters
		bufferlist bl;
		bl.hexdump(*_dout, false);
		txn->merge(bitmap_prefix, k, bl); // XOR with current value

	} else { // For multiple segments, process the first segment, the middle segment, and the last segment, the first and last segments as before

		// First paragraph
			// Similar to the above

			// Add key to position next paragraph
			first_key += bytes_per_key;

		// Middle segment, where the mask is all 1, so use all_set_bl
		while (first_key < last_key) {
			string k;
			make_offset_key(first_key, &k);
			all_set_bl.hexdump(*_dout, false);

			txn->merge(bitmap_prefix, k, all_set_bl); // XOR with current value

			// Add key to position next paragraph
			first_key += bytes_per_key;

		// Last paragraph
			// Similar to previous operations

The xor function seems complex and is full of bitwise operations. After careful analysis, allocation and release operations are the same as xor of segment bits and current values.A segment corresponds to a set of blocks, 128 by default, and a set of values in k/v.For example, when disk space is completely free, the k/v status is as follows: (b00000000, 0x00), (b00001000, 0x00), (b00002000, 0x00)...B is the prefix of key and represents bitmap.

<a name="chapter6"></a>Release

Releasing space is the same as allocating space.

void BitmapFreelistManager::release(uint64_t offset, uint64_t length, KeyValueDB::Transaction txn)
	_xor(offset, length, txn);

Author: Sminga

Tags: Ceph github osd

Posted on Mon, 10 Feb 2020 18:57:51 -0800 by banjax