BIG: C++ Strategies, Data Structures, and Algorithms aimed at Scalability

Abstract

Current trends suggest that horizontally-scaled, highly distributed, service-oriented architectures—of the kind used by Google, Facebook, Yahoo, and others—will become increasingly prevalent. Today’s advanced knowledge of scalability of distributed systems will most likely become tomorrow’s elementary topics. Such knowledge is likely of particular interest to C++ developers because—unlike in recent history—use of an efficient system-level language with strong modeling capabilities will translate directly to a good scaling factor (e.g. users per server and functionality per user) or, conversely, dramatic savings (in servers count, power consumed, maintenance, and more). The free lunch is over, so we better figure what’s for dinner.

Life in a (server) farm is harsh. You need to use quite a few non-intuitive strategies a C++ programming style because minute choices may cause large visible differences. For example, under certain circumstances, if you need a hash, you can’t get away with a map without ruining your performance. Several less known data structures and algorithms are necessary for coping.

This talk explores a dictionaries, exotic-yet-useful fast set intersection algorithms (“galloping search” is not only about Black Friday!), inverted indexes, and more.

Highlights

  • Discussion of dictionaries as a key data structure in large-scale systems
  • Comparison and contrast of std::map, std::unordered_map, and nonstandard dictionaries
  • An overview of large set intersection algorithms, which are at the core of many distributed database, searching, and machine learning systems
  • A foray into a distributed inverted index query engine modeled after Google 

Attendee Profile

C++ knowledge is assumed. Understanding of set-oriented algorithms (such as the selection problem, set intersection, or heap organization) is helpful but not assumed.

Outline

  • Dictionaries: std::map, std::unordered_map, and others
    • Introduction
    • Common knowledge
    • Search strategy
    • Integral vs. string keys
    • Small vs. large loads
    • Key distribution
    • Alternative structures
    • std::map affix scrambling tip
    • Takeaways
  • Large set intersection
    • Motivation
    • Basics
    • Implementation
    • Attempt at improvement
    • Solution
    • Scaling up
    • SvS
  • And now for something completely similar
    • Answering queries
    • Documents and terms
    • Distributing data
    • Document- vs. term-partitioned index
    • Local vs. distributed
  • Conclusion