Last updated: 17/8/2022

Spooky Memchr

super fast memchr implementation

the original implementation of memchr doesn't fully utilise the cpu, to find a match of the character in the given memory range, therefore there is a new more powerfull version that make use of the long register size to handle 8 bytes each time instead of character by character, the code is depicted from linux kernel, written by Yu-Jen Chang.


#if defined(CONFIG_ARCH_HAS_FAST_MULTIPLIER) && BITS_PER_LONG == 64

#define MEMCHR_MASK_GEN(mask) (mask *= 0x0101010101010101ULL)

#elif defined(CONFIG_ARCH_HAS_FAST_MULTIPLIER)

#define MEMCHR_MASK_GEN(mask)                                                	\
	do {                                                                   		\
		mask *= 0x01010101;                                                     \
		mask |= mask << 32;                                                     \
	} while (0)

#else

#define MEMCHR_MASK_GEN(mask)                                                  \
	do {                                                                       \
		mask |= mask << 8;                                                     \
		mask |= mask << 16;                                                    \
		mask |= mask << 32;                                                    \
	} while (0)

#endif

/**
  * memchr - Find a character in an area of memory.
  * @p: The memory area
  * @c: The byte to search for
  * @length: The size of the area.
  *
  * returns the address of the first occurrence of @c, or %NULL
  * if @c is not found
  */

void *memchr(const void *p, int c, unsigned long length)
{
	u64 mask, val;
	const void *end = p + length;

	c &= 0xff;
	if (p <= end - 8) {
		mask = c;
		MEMCHR_MASK_GEN(mask);

		for (; p <= end - 8; p += 8) {
			val = *(u64 *)p ^ mask;
			if ((val + 0xfefefefefefefeffu) &
			    (~val & 0x8080808080808080u))
				break;
 		}
 	}

	for (; p < end; p++)
		if (*(unsigned char *)p == c)
			return (void *)p;

 	return NULL;
 }

1. Preliminaries

the first condition checks the size of the memory buffer, the algorithm is effective only if the buffer is bigger than 8, otherwise we just look for the character in the conventional way.

then we process the buffer word(8 bytes) by word.

2. Generate the mask

what MEMCHR_MASK_GEN macro basically does is take a character and duplicate it on all bytes in the word, for example if the character is 0x42, the generated mask will be 0x4242424242424242.

3. Apply the mask

after generating the mask, we apply it on each word of the buffer val = *(u64 *)p ^ mask;, the intention behind this is to zero out each byte that contains the target character, because as we know if you xor same values you get 0. then store the result in val.

4. Magic

the punch line is: if ((val + 0xfefefefefefefeffu) & (~val & 0x8080808080808080u)) break;, let's break it down:

val + 0xfefefefefefefeffu: what it does is basically looking for bytes that contains 0's. if a byte is 0 in val after the addition it will have 0xff, other values will all have values less than that.
~val & 0x8080808080808080u: this is the second half of the equation, here we basically zero out any byte that contains a number bigger than 0x7f, so we will be lefted with the range [0, 0x7f].

combining the two, the byte that will be kept are those having top bit set(1, [0, 0x7f]) and [0x81, 0xff] + 0 (2), so the only number that belongs to the 2 ranges is 0 in summary the result will be truthy if the target character is in the word(val), and falsy otherwise.

this algorithm is not designed to find where character is located exactly, but instead just tells if there is one or not. so it's basically makes the search a lot faster by scaning the memory word by word instead of character by character, the effect becomes apparent when the character is located in the end of a long buffer or if it's not there at all, the algorithm delivers ~4 times the speed of the old one in this case.

5. Find the match

if it turns out there is a match in the word, we break from the loop, and do the manual scaning afterward. this apparently is the same as the old algorithm, but it gets executed only after we make sure there is match.

Resources:

the patch: https://lore.kernel.org/lkml/[email protected]/T/