Monday, November 18, 2013

On the Path to Personalization

NOVEMBER 15, 2013, 1:39 PM


This is the first in a series of posts exploring the underlying technology of our new recommendations engine.
Part of the web experience in the Age of the Smartphone is the expectation that digital interactions are being adjusted for personal tastes or preferences. There is an emerging effort to personalize presentation on the web in order to maximize relevance to users in ways that aren’t invasive. Ebay’s “Feed” and the Prismatic reader app (both launched just last year in 2012) are exemplars of applications that strike a nice balance between personalization and consistent presentation. Larger efforts, such as the Netflix Prize, have popularized machine learning and broadened interest in its technical elements beyond engineers.
We have long desired to expand our efforts on this front, and so I’m happy to introduce our new recommendations platform: just rolled out over the past few weeks. This release has focused on three main themes:
  • Real-time: New recommendations are created almost instantly while a user browses the news and reads articles.
  • Sensible, fast data model: We now store data in a manner that could be quickly queried by the user.
  • Flexibility: Algorithmic processing is separated from functional logic so that new algorithms can be incorporated on the fly.
Let’s explore some of the new technologies that allowed us to achieve these goals.
Real-time
From the very start, the web has always been changing and adjusting in near real-time. A user’s click or form submit has always contributed to a massive flow of real-time data, even in the early days.
What’s different today is the pace at which this flow can be used. Aggregation technology has improved significantly, allowing the computation of massive amounts of streaming information to fuel the modern web apps we all enjoy. What once was just banal clicks has become a plethora of tweets, shares, likes and a laundry list of other cutely named features that increases by the day. We’re all familiar with these kinds of applications.
Until real-time processing became practical, a lot of this kind of processing had to be performed offline in batch using MapReduce. This was an integral part of our previous recommendations engine. A list of articles users read was batched together every 15 minutes, chunked into a consumable format and then processed against a set of articles that were candidates for the recommendations list. The process was computationally expensive. Additionally, it was a somewhat brittle process: we relied on this job at regular intervals and had to closely monitor it. If, for any reason the regularly scheduled job were to fail, an extensive backfilling process was required. It looked pretty dated placed alongside the up-to-the-minute flow of news on our homepage.
Achieving real-time
To achieve real-time computation, margins on data latency are very tight. A synchronous action occurring when an article was read was simply out of the question. Emitting signals for immediate processing would never scale to the typical traffic levels we expect, let alone the occasional spikes when breaking news occurs. We needed a solution built on top of a messaging queue that supported “fire and forget” backfilling. This way, activity could flow in at its typical pace and be scheduled for processing as soon as possible, yielding output as close to the pulse of the news as possible.
Our efforts spread across two teams in separate departments. A great deal of what I’ll detail was implemented by our friends in the Customer Insight Group, a team at The Times that specializes in using statistical models to communicate with and understand our customers. Together, we built and retooled some existing reporting and modeling architecture into a system that could record and compute various common recommendation algorithms using a custom, in-house, domain-specific language. We relied on our Java codebase to schedule reads from the queue, handle back-off situations in which the queue experienced higher traffic and store output into our databases.
The domain-specific language, by comparison, represents all the algorithmic computation in one expressive language and is responsible for data modeling. Atop this consumption pipeline, we created a read-side grammar for fetch operations. The same language constructs would also take the serialized keys out of our database and perform post-processing before presenting them as a REST API.
The key to speed — Data modeling
In any system like this, where response time is a mission-critical goal and the core functionality is fetching some kind of aggregate information from persistent storage, the data model should be given special consideration. In traditional RDBMS structures, this is generally achieved by selecting smart indexes. When utilizing NoSQL, we had to be quite a bit craftier.
Our persistency layer had to to combine fast lookup with transactional integrity. There were forays into a lot of potential databases. Early implementations utilized an in-house Cassandra cluster, leveraging the write optimization Cassandra is known for to accommodate the swift pace of real time aggregation. As the project moved into the production stage, we decided to move to Amazon’s DynamoDB, which features a somewhat similar column-oriented storage engine with all the advantages of the cloud (mainly the ability to horizontally scale). Not only did its range/hash key lookup suit the way we planned to index our data, but we also saw the price drop midway through our implementation.
We wanted to make sure that our data encoded not only the relevant action a user had taken, but also a roadmap of how the record was constructed. This would allow us greater flexibility when new algorithms demanded additional metadata: new records could incorporate their required fields by storing them in the key.
The keys we’re storing might look a little funny at first, but if you look closely you’ll see that all the metadata required for our recommendations algorithms are burned directly into the record (and also queried on the read-side using them). Here, for example, is one representing a user’s read:
s@UID:s@007f00000000000000:s@asset_click_activity.agg:s@story_agg:s@YEAR_MONTH_DAY:s@2013_08_12
From that one key, we can extract:
  1. The user this record was stored for (in the form of a user ID hash)
  2. The name of the aggregation that created the record
  3. The algorithm that will require this information downstream
  4. The date (and format/scope of the date) for which this record is relevant.
  5. Data types for all these entries, in the form of the s (string) preceding each.
There’s some flexibility here. Say a future algorithm needed aggregates for the scope of an entire month — all that would be required is the key for that entry use a YEAR_MONTH format rather than including the day. All of these corner cases are addressed in our grammar.
In the next article in the series, we’ll take a look at how this information ends up in DynamoDB and what blueprint it follows in our aggregation pipeline.

No comments:

Post a Comment