Daniel Lemire's blog

Fast Buffer-to-String conversion in JavaScript with a Lookup Table

, 7 min read

When programming in a JavaScript environment such as Node.js, you might recover raw data from the network and need to convert the bytes into strings. In a system such as Node.js, you may represent such raw bytes using a Buffer instance. You can conveniently convert a Buffer instance into a…

How fast can you validate UTF-8 strings in JavaScript?

, 1 min read

When you recover textual content from the disk or from the network, you may expect it to be a Unicode string in UTF-8. It is the most common format. Unfortunately, not all sequences of bytes are valid UTF-8 and accepting invalid UTF-8 without validating it is a security risk. How might you validate…

A simple WebSocket benchmark in Python

, 2 min read

Modern web applications often use the http/https protocols. However, when the server and client needs to talk to each other in a symmetrical fashion, the WebSocket protocol might be preferable. For example, if you want to program a multiplayer video game, the WebSocket protocol is almost surely…

Parsing 8-bit integers quickly

, 8 min read

Suppose that you want to parse quickly 8-bit integers (0, 1, 2, …, 254, 255) from an ASCII/UTF-8 string. The problem comes up in the simdzone project lead by Jeroen Koekkoek (NLnet Labs). You are given a string and its length: e.g., ’22’ and length is 2. The naive approach in C might be: int…

A simple WebSocket benchmark in JavaScript: Node.js versus Bun

, 3 min read

Conventional web applications use the http protocol (or the https variant). The http protocol is essentially asymmetrical: a client application such as a browser issues requests and the server responds. It is not possible for the server to initiate communication with the client. Certain types of…

Science and Technology links (November 12 2023)

, 2 min read

Vitamin K2 supplements might reduce the risk of myocardial infarction (heart attacks) and of all-cause death (Hasific et al. 2022). You find vitamin K2 in some Gouda cheeses and in eggs. Most of the water on Earth is salinated (in the oceans) and cannot be consumed. Fresh water is often scarce.…

Generating arrays at compile-time in C++ with lambdas

, 2 min read

Suppose that you want to check whether a character in C++ belongs to a fixed set, such as ‘\0’, ‘\x09’, ‘\x0a’,’\x0d’, ‘ ‘, ‘#’, ‘/’, ‘:’, ‘<‘, ‘>’, ‘?’, ‘@’, ‘[‘, ‘\’, ‘]’, ‘^’, ‘|’. A simple way is to generate a 256-byte…

Appending to an std::string character-by-character: how does the capacity grow?

, 3 min read

In C++, suppose that you append to a string one character at a time: while(my_string.size() <= 10'000'000) { my_string += "a"; } In theory, it might be possible for the C++ runtime library to implement this routine as the creation of a new string with each append: it could allocate…

For processing strings, streams in C++ can be slow

, 2 min read

The C++ library has long been organized around stream classes, at least when it comes to reading and parsing strings. But streams can be surprisingly slow. For example, if you want to parse numbers, then this C++ routine is close to being the worst possible choice for performance: std::stringstream…

How many billions of transistors in your iPhone processor?

, 1 min read

In about 10 years, Apple has multiplied by 19 the number of transistors in its mobile processors. It corresponds roughly to a steady rate of improvement of 34% per year on the number of transistors, or a doubling every 2.5 years. In real dollars, an iPhone has roughly a constant price: the price…

Randomness in programming (with Go code)

, 24 min read

Computer software is typically deterministic on paper: if you run twice the same program with the same inputs, you should get the same outputs. In practice, the complexity of modern computing makes it unlikely that you could ever run twice the same program and get exactly the same result, down to…

Web server `hello world´ benchmark : Go vs Node.js vs Nim vs Bun

, 5 min read

The Web is a convenient interface to your software. Many times, if you have an existing application, you may want to allow Web access to it using HTTP. Or you may want to build a small specialized Web application. In such instances, you do not want to use an actual Web server (e.g., Apache or…

Parsing integers quickly with AVX-512

, 3 min read

If I give a programmer a string such as "9223372036854775808" and I ask them to convert it to an integer, they might do the following in C++: std::string s = .... uint64_t val; auto [ptr, ec] = std::from_chars(s.data(), s.data() + s.size(), val); if (ec != std::errc()) {} // I have an…

Transcoding Unicode strings at crazy speeds with AVX-512

, 5 min read

In software, we store strings of text as arrays of bytes in memory using one of the Unicode Transformation Formats (UTF), the most popular being UTF-8 and UTF-16. Windows, Java, C# and other systems common languages and systems default on UTF-16, whereas other systems and most of the web relies on…

Locating `identifiers´ quickly (ARM NEON edition)

, 3 min read

A common problem in parsing is that you want to find all identifiers (e.g., variable names, function names) in a document quickly. There are typically some fixed rules. For example, it is common to allow ASCII letters and digits as well as characters like ‘_’ in the identifier, but to forbid…

Science and Technology links (September 2 2023)

, 4 min read

Physicists have a published a paper with 5154 authors. The list of authors takes 24 pages out of the 33 pages. The lesson is that if someone tell you that they have published an important paper, you should ask how many authors there were and what their exact role was. Vegatarians are at higher…

Transcoding Latin 1 strings to UTF-8 strings at 18 GB/s using AVX-512

, 5 min read

Though most strings online today follow the Unicode standard (e.g., using UTF-8), the Latin 1 standard is still in widespread inside some systems (such as browsers) as JavaScript strings are often stored as either Latin 1, UTF-8 or UTF-16 internally. Latin 1 captures the first 256 characters from…

How accurate is the birthday´s paradox formula?

, 1 min read

Given a set of r random values from a large set (of size N), I have been using the formula 1-exp(-r**2/(2N)) to approximate the probability of a collision. It assumes that r is much smaller than N. The formula suggests that if you have hundreds of millions of random 64-bit numbers, you will start…