I recently authored a library to parse massive amounts of logs from strace.

One of the key functions in the library was a function to parse the ASCII data strace generates and output specific syscall types. A syscall type could be:

struct syscall_rename_t {
    std::string_view oldpath;
    std::string_view newpath;
    int result;
}

In general, it would have an interface like this:

auto parse(std::string_view strace_line);

parse would output a parsed strace line each time the parser correctly parsed a line of strace output.

The question was, what type it should return. My first attempt was to make it a variant of all possible syscalls:

using syscall_t = std::variant<syscall_rename_t, syscall_open_t, ...>;
syscall_t parse(std::string_view strace_line);

In the parser, I would have a std::unordered_map, mapping a syscall name to a handler:

const std::unordered_map<std::string, std::function<syscall_t(std::string_view)>> syscall_handlers {
    {"rename", parse_rename},
    {"open", parse_open},
    ...
}

syscall_rename_t parse_rename(std::string_view strace_line) {
   // parsing strace rename syscall ascii code to a syscall_rename_t;
   return syscall_rename;
}
...

This function could be called by the parse function.

void parse(std::string_view strace_line, const handler_t& handler) {
    const std::string syscall_name = get_syscall_name(strace_line);
    return syscall_handlers[syscall_name](strace_line);
}

In the handler, you would use get_if, or some visitor to determine which sort of syscall had been parsed.

An example of this approach can be seen here

However, there are several shortcomings to this approach.

  1. The variant for containing all syscalls becomes large
  2. The use of std::function results in an indirect function call

Templated solution

To avoid these two issues, we can make parse a template:

template<typename HandlerT>
void parse(std::string_view strace_line, HandlerT handler);

handler would be called if the data was correctly parsed. Also, a specialized handler would be instantiated for each type of syscall supported by the library. This could be handled relatively nicely by constexpr if.

The issue is now, how we call handler with the right type.

We could chain lots of ifs:

if (syscall_name == "rename") {
    handler(parse_rename(strace_line));
} else if (syscall_name == "open") {
    handler(parse_open(strace_line));
} ...

However, for more than a handful of syscalls, this becomes slow, even though compilers will try to optimize the number of comparisons needed.

Compile-time hash-map

A better solution (albeit less maintainable) is to implement a compile-time hash-map. The idea is to calculate the hash of the string, an then make a switch-case statement to jump to the correct handler for the given string:

  const auto hash = hashfunc(syscall_name);
  switch (hash) {
    case hashfunc("rename"): if (syscall_name == "rename") handler(parse_syscall_rename(strace_line)); break;
    case hashfunc("open"):   if (syscall_name == "open")   handler(parse_syscall_open  (strace_line)); break;
    ...
  }

If the range of values hash can have is limited, most compilers will compile this into something similar to computed gotos

  const auto hash = hashfunc(syscall_name);
  const auto bucket_num = hash % num_buckets;
  switch (bucket_num) {
    case hashfunc("rename") % num_buckets: if (syscall_name == "rename") handler(parse_syscall_rename(strace_line)); break;
    case hashfunc("open") % num_buckets:   if (syscall_name == "open")   handler(parse_syscall_open  (strace_line)); break;
    ...
  }

Or with some preprocessor “magic”1:

  const auto hash = hashfunc(syscall_name);
  const auto bucket_num = hash % num_buckets;
  switch (bucket_num) {
  #define STRACE_PARSER_DEFINE_BUCKET(name) case hashfunc(#name) % num_buckets: if (syscall_name == #name) handler(parse_syscall_##name(strace_line)); break
    STRACE_PARSER_DEFINE_BUCKET("rename");
    STRACE_PARSER_DEFINE_BUCKET("open");
    ...
  }

The only thing left to do now, is to find an appropriate constexpr hash function that has no collisions. The compiler will actually help us to find one, as it will complain about a duplicate case label. The hash function can be as simple as possible, as long as it doesn’t produce collisions. A simple addition of all the chars in the string might be sufficient.

The biggest issue with this implementation is that finding a hashing function that doesn’t give collisions might be tricky if the number of entries is big.

An example of this implementation can be found here.

  1. I really don’t like macros, but they do have their place in a very limited scope.