What exactly is the Bloom filter? It's clear from this lecture

I don't know when it started, but the original obscure Bloom filter got a big reputation. In the interview, the interviewer asked how to avoid cache penetration. Your first reaction might be the Bloom filter. Cache penetration = Bloom filter became the standard match, but it's not clear what Bloom filter is and how to use it. Let's talk about it today.Clear, clear.

Cache Penetration


Look at this picture, the user may have made a query with bad condition. At this time redis does not exist. We just go to the database to find it according to the normal process, but this is a bad condition query. The database certainly does not exist, nor does it write values to redis, and returns an empty value to the user. It is OK to do this twice, but more times., I put redis in order to block and relieve the pressure on the database. Now redis has become a fictitious thing. Every time we go to the database to find it, this is called cache penetration. This is equivalent to redis does not exist and is broken. To solve this situation, we can cache an empty string or a special string in redis, such as &&, next time we go to redisWhen querying in, when the value is empty or &&, we know that the value is not in the database, we will not query in the database, ps: here the cache does not exist when the key must set expiration time, otherwise when the database has added this record, this will lead to cache and database inconsistencies

This is a case where you repeatedly query the same non-existent value. What if the non-existent value is different for each query?Even if you cache a special string every time, it's not useful because its value is different, for example, our database user id is 111, 112, 113, 114 increasing in turn, but someone wants to***you, deliberately take-100, -936, -545 such a messy key to query, redis and database values do not exist at this time, people take different keys every time, you are also cachedNo use, at this time the pressure of the database is quite high, which is much more scary than the above situation. What can I do? At this time, our protagonist Bloom filter will come on stage.

Start with an interview question

Question: How can I quickly tell if an element exists in a large number of elements (e.g., 1 billion unordered, variable length, no repetition)?Okay, the easiest thing to think about is putting so much data into a data structure like List, Map, Tree. Can't it come out without a search, like map.get(),Let's assume a field with one byte of an element and a data of 1 billion would take about 900G Memory space, which is unacceptable for ordinary servers, is also an answer that interviewers don't want to hear because it's awkward. We must use a good and clever way to solve this problem. This introduces a space-saving data structure, bitmap, which is an ordered array with only two values, 0 and 1.0 for nonexistence and 1 for existence.


Now we need a mapping relationship. You always know where an element is located, and then look at whether it is 0 or 1. How to solve this problem, you need to use a hash function. There are two advantages of using a hash function. The first is that a hash function, regardless of the length of the input value, gives an output value of 0 or 1.Fixed, the second is that his distribution is uniform, so how to distinguish if the whole block goes, such as MD5, SHA-1 are common hash algorithms.


We can find out if it exists by calculating the hash function. We look at the red line, and the hash values from the hash function for 24 and 147 are the same. We call this a hash conflict or a hash collision.Hash collisions are inevitable. What we can do is to reduce the probability of hash collisions. The first is to expand the length of the dimension array or the bitmap capacity, because our functions are evenly distributed, the larger the bitmap capacity, the smaller the probability of hash collisions occurring at the same location.But bigger bitmap capacity means more memory consumption, so we want to think about other ways to solve it. The second way is to calculate with more hash functions. You know, 24 and 147 now collide with each other after one calculation. That's really the fate if I can collide with 5, 10, 100 calculations, you can be togetherBut it's not that the more hash function calculations, the better because they fill up the bitmap very quickly and take time to compute, so we need to find a balance between time and space.

Bloom filter

Of course, it has been studied before. In 1970, an ancestor named Bloom studied the problem of determining whether elements in a large number of elements were present, that is, how much bitmap capacity was needed and how many hash functions were needed. It published a paper, and the container proposed was called Bloom filter.


Let's take a look at this diagram. Let's look at the three elements in the set. Now we want to save, for example, a, after f1(a), f2(a), f3(a), after three hash functions, 1 is stored in the corresponding location, and element b, C is also calculated in the corresponding location through these three functions.When it is taken, element a is calculated by the f1(a) function and found that this position is 1, no problem, the second position is 1, and the third position is also 11, at this time we say that this a exists in the Bloom filter, no problem. Similarly, if we look at this d below, the result found by three calculations is also 1. Can we say that d exists in the Bloom filter, obviously not. The three 1s we carefully see are actually f1(a), f1(b), f2(c) stored in, not d stored in itself.This is also the result of a hash collision, which we call a False Positive Probability (FPP) when elements in a Bloom filter are mistaken for existence.

Let's look at another element, element E.We want to determine if it exists in a container, and we also need to use these three functions to calculate it.The first position is 1, the second position is 1, and the third position is 0.So can element E tell if it's in the Bloom filter?The answer is yes, e must not exist.You know, if e exists, all three places were set at 1 when he saved it, and now one of them is found to be 0, proving that he didn't save it.From the illustration above, we draw two important conclusions.

From the container's point of view:

If the Bloom filter determines that an element exists in the set, it does not necessarily exist

If the Bloom filter judges that it does not exist, it must not

From an element perspective:

If the element actually exists, the Bloom filter must determine it exists

If the element does not actually exist, the Bloom filter may determine it exists

Remember, guys

Guava Implements Bloom Filter

Why does java write a lot of people and have a big base because it's open source, embraces open source, has many frames, has more wheels, and has more than one wheel for a function. Just serialize it with fastjson, jackson, gson, whatever you choose. The wheel of the Bloom filter is guava provided by google. Let's see how to use it with code

First introduce our rack

      <dependency>          <groupId>com.google.guava</groupId>          <artifactId>guava</artifactId>          <version>21.0</version>      </dependency>

Here you store 1 million elements in the Bloom filter, and then test the correct and false positives of 100 and 900 elements, respectively.

    //How much data to insert? private static final int insertions = 1000000;

    //Expected Miscarriage Rate: private static double fpp = 0.02;

    public static void main(String[] args) {

        //Initialize a BloomFilter storing string data with a default error rate of 0.03. (BloomFilter <String> bf = BloomFilter.create(Funnels.stringFunnel(Charsets.UTF_8), insertions, fpp);

        //Use to store all actual key s, whether they exist or not. Set <String> sets = new HashSet <String> (insertions);

        //Use to store all the actual key s and remove List <String> lists = new ArrayList <String> (insertions);

        //Insert a random string for (int I = 0; I < insertions; i++) {
            String uuid = UUID.randomUUID().toString();
            bf.put(uuid);
            sets.add(uuid);
            lists.add(uuid);
        }

        int rightNum = 0;
        int wrongNum = 0;

        for (int i = 0; i < 10000; i++) {
            //Between 0 and 10000, there are 100 numbers (multiples of 100) divisible by 100. String data = i% 100 == 0? lists.get(i / 100): UUID.randomUUID().toString();

            //Maybe it doesn't look very confident, so if the Bloom filter decides it exists, we'll go to the hammer in the sets.if (bf.might Contain(data){
                if (sets.contains(data)) {
                    rightNum++;
                    continue;
                }
                wrongNum++;
            }
        }

        BigDecimal percent = new BigDecimal(wrongNum).divide(new BigDecimal(9900), 2, RoundingMode.HALF_UP);
        BigDecimal bingo = new BigDecimal(9900 - wrongNum).divide(new BigDecimal(9900), 2, RoundingMode.HALF_UP);
        System.out.println("At 100 W Of the 100 elements, 100 actually exist, which the Bloom filter considers to exist:" + rightNum);
        System.out.println("At 100 W Of the 9,900 elements, 9,900 were judged to be actually non-existent, mistaken for existence:" + wrongNum + ",Hit Ratio:" + bingo + ",Misjudgement rate:" + percent);
    }

The final result


We can see that this is a confirmation of the above conclusion: the 100 real elements must exist in the Bloom filter, and the other 9,900 non-existent elements, the Bloom filter still judged 216 existence, which is a mistake, because it is also mentioned above, so the Bloom filter is not omnipotent, but it can help us resist most of the data that does not exist.Very good. It has relieved a lot of pressure on the database. In addition, the false positive rate of 0.02 was set by ourselves when we initialized the Bloom filter. If we do not set the default of 0.03, we must not set 0 when we set it ourselves!

Redis Implements Bloom Filters

Using guava to implement the Bloom filter above is to put the data in local memory. Our projects are often distributed. We can also put the data in redis and use redis to implement the Bloom filter. This requires us to design our own mapping function, measure the length of the binary vectors, paste the code below, and you can use it directly. It has been tested.

/**
 * Bloom Filter Core Class
 *
 * @param <T>
 * @author jack xu
 */public class BloomFilterHelper<T> {    private int numHashFunctions;
    private int bitSize;
    private Funnel<T> funnel;

    public BloomFilterHelper(int expectedInsertions) {
        this.funnel = (Funnel<T>) Funnels.stringFunnel(Charset.defaultCharset());
        bitSize = optimalNumOfBits(expectedInsertions, 0.03);
        numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, bitSize);
    }

    public BloomFilterHelper(Funnel<T> funnel, int expectedInsertions, double fpp) {
        this.funnel = funnel;
        bitSize = optimalNumOfBits(expectedInsertions, fpp);
        numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, bitSize);
    }

    public int[] murmurHashOffset(T value) {
        int[] offset = new int[numHashFunctions];

        long hash64 = Hashing.murmur3_128().hashObject(value, funnel).asLong();
        int hash1 = (int) hash64;
        int hash2 = (int) (hash64 >>> 32);
        for (int i = 1; i <= numHashFunctions; i++) {
            int nextHash = hash1 + i * hash2;
            if (nextHash < 0) {
                nextHash = ~nextHash;
            }
            offset[i - 1] = nextHash % bitSize;
        }

        return offset;
    }

    /**
     * Calculate the length of the bit array
     */    private int optimalNumOfBits(long n, double p) {
        if (p == 0) {
            p = Double.MIN_VALUE;
        }
        return (int) (-n * Math.log(p) / (Math.log(2) * Math.log(2)));
    }

    /**
     * Calculate hash method execution times
     */    private int optimalNumOfHashFunctions(long n, long m) {
        return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
    }
}

Here you are working with bitmap of redis. You probably know only five types of redis data, string, list, hash, set, zset. You haven't heard of bitmap, but it doesn't matter. You can say he's a new type of data, or you can say no, because he's still a string in nature. I'll also write an article about the data types and their use in the Internet later.Scene.

/**
 * redis Operating Bloom Filters
 *
 * @param <T>
 * @author xhj
 */public class RedisBloomFilter<T> {
    @Autowired    private RedisTemplate redisTemplate;

    /**
     * Delete Cached KEY
     *
     * @param key KEY
     */    public void delete(String key) {
        redisTemplate.delete(key);
    }

    /**
     * Batch additions perform poorly when used when adding an element based on a given Bloom filter add value
     *
     * @param bloomFilterHelper Bloom Filter Object
     * @param key               KEY
     * @param value             value
     * @param <T>               Generic, can pass in any type of value
     */    public <T> void add(BloomFilterHelper<T> bloomFilterHelper, String key, T value) {
        int[] offset = bloomFilterHelper.murmurHashOffset(value);
        for (int i : offset) {
            redisTemplate.opsForValue().setBit(key, i, true);
        }
    }

    /**
     * Based on the given value added by the Bloom filter, when adding a batch of elements, the batch add performance is good, using pipeline mode (if it is clustered, use the optimized RedisPipeline operation)
     *
     * @param bloomFilterHelper Bloom Filter Object
     * @param key               KEY
     * @param valueList         Value, list
     * @param <T>               Generic, can pass in any type of value
     */    public <T> void addList(BloomFilterHelper<T> bloomFilterHelper, String key, List<T> valueList) {
        redisTemplate.executePipelined(new RedisCallback<Long>() {
            @Override            public Long doInRedis(RedisConnection connection) throws DataAccessException {
                connection.openPipeline();
                for (T value : valueList) {
                    int[] offset = bloomFilterHelper.murmurHashOffset(value);
                    for (int i : offset) {
                        connection.setBit(key.getBytes(), i, true);
                    }
                }
                return null;
            }
        });
    }

    /**
     * Determine whether a value exists based on a given Bloom filter
     *
     * @param bloomFilterHelper Bloom Filter Object
     * @param key               KEY
     * @param value             value
     * @param <T>               Generic, can pass in any type of value
     * @return Is there any
     */    public <T> boolean contains(BloomFilterHelper<T> bloomFilterHelper, String key, T value) {
        int[] offset = bloomFilterHelper.murmurHashOffset(value);
        for (int i : offset) {
            if (!redisTemplate.opsForValue().getBit(key, i)) {
                return false;
            }
        }
        return true;
    }
}

Finally, the test class

    public static void main(String[] args) {
        RedisBloomFilter redisBloomFilter = new RedisBloomFilter();
        int expectedInsertions = 1000;
        double fpp = 0.1;
        redisBloomFilter.delete("bloom");
        BloomFilterHelper<CharSequence> bloomFilterHelper = new BloomFilterHelper<>(Funnels.stringFunnel(Charset.defaultCharset()), expectedInsertions, fpp);
        int j = 0;
        //Add 100 elements) List <String> valueList = new ArrayList <>();
        for (int i = 0; i < 100; i++) {
            valueList.add(i + "");
        }
        long beginTime = System.currentTimeMillis();
        redisBloomFilter.addList(bloomFilterHelper, "bloom", valueList);
        long costMs = System.currentTimeMillis() - beginTime;
        log.info("Bloom Filter Add{}Values, time consuming:{}ms", 100, costMs);
        for (int i = 0; i < 1000; i++) {
            boolean result = redisBloomFilter.contains(bloomFilterHelper, "bloom", i + "");
            if (!result) {
                j++;
            }
        }
        log.info("Missing{}individual,Verification results take time:{}ms", j, System.currentTimeMillis() - beginTime);
    }

Note that the add List is used here, with the pipelining pipeline at the bottom, and the Addmethod with a setBit for a for loop at the bottom. This is slow, but it can return a value to know if the insert was successful or not, and pipelining does not know, so choose which method to use to see your business scenario and how fast it needs to be inserted.

Bloom filter working position

The first step is to load all the data from the database into the Bloom filter.The second step goes to the Bloom filter to query when there is a request. If bf says no, the third step returns directly.If bf says yes, the process before going down.ps: In addition, guava only has put method in its data loading. Guys can think about what to do with data deletion and modification in Bloom filter. Why is there no delete method?


Other scenarios for Bloom filters

  • Web crawlers weight URLs to avoid crawling the same URL address.

  • Anti-spam, judging whether a mailbox is spam from billions of spam lists;

  • Google Chrome uses Bloom filters to identify malicious URL s;

  • Medium uses a Bloom filter to avoid recommending articles that users have read;

  • Google BigTable, Apache HBbase, and Apache Cassandra use Bloom filters to reduce searches for non-existent rows and columns.

Okay, this is the end of the Bloom filter. Later in the interview, the interviewer asked what to do with the cache breakdown. I believe you should be able to answer the question first. It's as easy as I can understand, and then it can be used in work, such as authentication service. When a user logs in, he can use the Bloom filter first, not go directly to redi.S, Database Check

Welcome to the technology blogger. This public number will regularly share dry technology products, learning documents, interview treasures, etc.


Tags: Java Database Redis Google Apache

Posted on Wed, 13 May 2020 09:59:45 -0700 by broann