2010 Collaborative filtering with temporal dynamics

2020-03-01 58浏览

  • 1.doi:10.1145/1721654. 1 72 1 6 77 Collaborative Filtering with Temporal Dynamics By Yehuda Koren Abstract Customer preferences for products are drifting over time. Product perception and popularity are constantly changing as new selection emerges. Similarly, customer inclinations are evolving, leading them to ever redefine their taste. Thus, modeling temporal dynamics is essential for designing recommender systems or general customer preference models. However, this raises unique challenges. Within the ecosystem intersecting multiple products and customers, many different characteristics are shifting simultaneously, while many of them influence each other and often those shifts are delicate and associated with a few data instances. This distinguishes the problem from concept drift explorations, where mostly a single concept is tracked. Classical time-window or instance decay approaches cannot work, as they lose too many signals when discarding data instances. A more sensitive approach is required, which can make better distinctions between transient effects and long-term patterns. We show how to model the time changing behavior throughout the life span of the data. Such a model allows us to exploit the relevant components of all data instances, while discarding only what is modeled as being irrelevant. Accordingly, we revamp two leading collaborative filtering recommendation approaches. Evaluation is made on a large movie-rating dataset underlying the Netflix Prize contest. Results are encouraging and better than those previously reported on this dataset. In particular, methods described in this paper play a significant role in the solution that won the Netflix contest. 1. INTRODUCTION Modeling time drifting data is a central problem in data mining. Often, data is changing over time, and models should be continuously updated to reflect its present nature. The analysis of such data needs to find the right balance between discounting temporary effects that have very low impact on future behavior, while capturing longer term trends that reflect the inherent nature of the data. This led to many works on the problem, which is also widely known as ­concept drift; see, e.g., Schlimmer and Granger, and Widmer and Kubat.15, 20 Temporal changes in customer preferences bring unique modeling challenges. One kind of concept drift in this setup is the emergence of new products or services that change the focus of customers. Related to this are seasonal changes, or specific holidays, which lead to characteristic shopping patterns. All those changes influence the whole population, and are within the realm of traditional studies on concept drift. However, many of the changes in user behavior are driven by localized factors. For example, a change in the family structure can drastically change shopping patterns. Likewise, individuals gradually change their taste in movies and music. Such changes cannot be captured by methods that seek a global concept drift. Instead, for each customer we are looking at different types of concept drifts, each occurs at a distinct time frame and is driven toward a different direction. The need to model time changes at the level of each ­individual significantly reduces the amount of available data for detecting such changes. Thus we should resort to more accurate techniques than those that suffice for modeling global changes. For example, it would no longer be adequate to abandon or simply underweight far in time user transactions. The signal that can be extracted from those past actions might be invaluable for understanding the customer herself or be indirectly useful to modeling other customers. Yet, we need to distill long-term patterns while discounting transient noise. These considerations require a more sensitive methodology for addressing drifting ­customer preferences. It would not be adequate to concentrate on identifying and modeling just what is relevant to the present or the near future. Instead, we require an accurate modeling of each point in the past, which will allow us to distinguish between persistent signal that should be captured and noise that should be isolated from the longer term parts of the model. Modeling user preferences is relevant to multiple applications ranging from spam filtering to market-basket analysis. Our main focus in the paper is on modeling user preferences for building a recommender system, but we believe that general lessons that we learn would apply to other applications as well. Automated recommendations are a very active research field.12 Such systems analyze patterns of user interest in items or products to provide personalized recommendations of items that will suit a user’s taste. We expect user preferences to change over time. The change may stem from multiple factors; some of these factors are fundamental while others are more circumstantial. For example, in a movie recommender system, users may change their preferred genre or adopt a new viewpoint on an actor or director. In addition, they may alter the appearance of their feedback. For example, in a system A previous version of this paper appeared in the Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2009), 447–456. a p r il 2 0 1 0 vo l . 5 3 n o. 4 c om m u n icat ion s of t he ac m 89
  • 2.research highlights 90 communicatio ns o f th e acm ap ril 2 0 10 vol . 53 no. 4 Figure 1. Two temporal effects emerging within the Netflix movierating dataset.Top:the average movie-rating made a sudden jump in early 2004 (1,500 days since the first rating in the dataset).Bottom:ratings tend to increase with the movie age at the time of the rating. Here, movie age is measured by the time span since its first rating event within the dataset. In both charts, each point averages 100,000 rating instances. Rating by date 3.9 3.8 Mean score 3.7 3.6 3.5 3.4 3.3 3.2 0 500 1000 1500 Time (days) 2000 2500 2000 2500 Rating by movie age 3.9 3.8 3.7 Mean score where users provide star ratings to products, a user that used to indicate a neutral preference by a “3 stars” input may now indicate dissatisfaction by the same “3 stars” feedback. Similarly, it is known that user feedback is influenced by anchoring, where current ratings should be taken as relative to other ratings given at the same short period. Finally, in many instances, systems cannot separate different household members accessing the same account, even though each member has a different taste and deserves a separate model. This creates a de facto ­multifaceted meta-user associated with the account. A way to distinguish between different persons is by assuming that time-­adjacent accesses are being done by the same member (sometimes on behalf of other members), which can be naturally captured by a temporal model that assumes a drifting nature of a customer. All these patterns and the likes should have made ­temporal modeling a predominant factor in building recommender systems. Nonetheless, with very few exceptions (e.g., Ding and Li, and Sugiyama et al.4, 16), the recommenders’ ­literature does not address temporal changes in user behavior. Perhaps this is because user behavior is composed of many different concept drifts, acting in different timeframes and directions, thus making common methodologies for dealing with concept drift and temporal data less successful. We show that capturing time drifting patterns in user behavior is essential for improving accuracy of recommenders. Our findings also give us hope that the insights from ­successful time modeling for recommenders will be useful in other data mining applications. Our test bed is a large movie-rating dataset released by Netflix as the basis of a well-publicized competition.3 This dataset combines several merits for the task at hand. First, it is not a synthetic dataset, but contains user-movie ratings by real paying Netflix subscribers. In addition, its relatively large size—above 100 million date-stamped ratings—makes it a better proxy for real-life large-scale datasets, while putting a premium on computational efficiency. Finally, unlike some other dominant datasets, time effects are natural and are not introduced artificially. Two interesting (if not surprising) temporal effects that emerge within this dataset are shown in Figure 1. One effect is an abrupt shift of rating scale that happened in early 2004. At that time, the mean rating value jumped from around 3.4 stars to above 3.6 stars. Another significant effect is that ratings given to movies tend to increase with the movie age. That is, older movies receive higher ratings than newer ones. In Koren,8 we shed some light on the origins of these effects. The major contribution of this work is presenting a methodology and specific techniques for modeling time drifting user preferences in the context of recommender systems. The proposed approaches are applied on the aforementioned extensively analyzed movie-ratings dataset, enabling us to firmly compare our methods with those reported recently. We show that by incorporating temporal information, we achieve best results reported so far, indicating the significance of uncovering temporal effects. The rest of the paper is organized as follows. In the next section we describe basic notions and notation. Then, in 3.6 3.5 3.4 3.3 3.2 0 500 1000 1500 Movie age (days) Section 3, our principles for addressing time changing user preferences are evolved. Those principles are then incorporated, in quite different ways, into two leading recommendertechniques:factor modeling (Section 4) and item–item neighborhood modeling (Section 5). 2. PRELIMINARIES 2.1. Notation We are given ratings for m users (aka customers) and n items (aka products). We reserve special indexing letters to distinguish users fromitems:for users u, v, and for items i, j. A rating rui indicates the preference by user u of item i, where high values mean stronger preference. For example, values can be integers ranging from 1 (star) indicating no interest to 5 (stars) indicating a strong interest. We distinguish predicted ratings from known ones, by using the notation ^rui for the predicted value of rui. The scalar tui denotes the time of rating rui. One can use different time units, based on what is appropriate for the
  • 3.application at hand. For example, when time is measured in days, then tui counts the number of days elapsed since some early time point. Usually the vast majority of ratings are unknown. For example, in the Netflix data 99% of the possible ratings are missing because a user typically rates only a small portion of the movies. The (u, i) pairs for which rui is known are stored in the set K = {(u, i) rui is known}, which is known as the training set. Models for the rating data are learned by fitting the previously observed ratings. However, our goal is to generalize those in a way that allows us to predict future, unknown ratings. Thus, caution should be exercised to avoid overfitting the observed data. We achieve this by using a technique called regularization. Regularization restricts the complexity of the models, thereby preventing them from being too specialized to the observed data. We employ L2-regularization, which penalizes the magnitude of the learned parameters. Extent of regularization is controlled by constants which are denotedas:l1, l2, … 2.2. The Netflix data We evaluated our algorithms on a movie-rating dataset of more than 100 million date-stamped ratings ­performed by about 480,000 anonymous Netflix customers on 17,770 movies between 31 December 1999 and 31 December 2005.3 Ratings are integers ranging between 1 and 5. On average, a movie receives 5,600 ratings, while a user rates 208 movies, with substantial variation around each of these averages. To maintain compatibility with results published by others, we adopted some common ­standards. We evaluated our methods on two comparable sets designed byNetflix:a holdout set (“Probe set”) and a test set (“Quiz set”), each of which contains over 1.4 million ratings. Reported results are on the test set, while experiments on the holdout set show the same findings. In our time-modeling context, it is important to note that the test instances of each user come later in time than his/her training instances. The quality of the results is measured by their root mean squared error (RMSE) The Netflix data is part of the Netflix Prize contest, with the target of improving the accuracy of Netflix movie recommendations by 10%. The benchmark is Netflix’s proprietary system. Cinematch, which achieved an RMSE of 0.9514 on the test set. The grand prize was awarded to a team that managed to drive this RMSE to 0.8554 after almost 3 years of extensive efforts. Achievable RMSE values on the test set lie in a quite compressed range, as evident by the difficulty to win the grand prize. Nonetheless, there is evidence that small improvements in RMSE terms can have a significant impact on the quality of the top few presented recommendations.7 The algorithms described in this work played a central role in reaching the grand prize. 2.3. Collaborative filtering Recommender systems are often based on collaborative filtering (CF), a term coined by the developers of the first recommender system—Tapestry.5 This technique relies only on past user behavior—e.g., their previous transactions or product ratings—without requiring the creation of explicit profiles. CF analyzes relationships between users and interdependencies among products, in order to identify new user–item associations. A major appeal of CF is that it is domain-free and avoids the need for extensive data collection. In addition, relying directly on user behavior allows uncovering complex and unexpected patterns that would be difficult or impossible to profile using known data attributes. As a consequence, CF attracted much of attention in the past decade, resulting in significant progress and being adopted by some successful commercial systems, including Amazon,10 TiVo,1 and Netflix. The two primary areas of CF are the neighborhood methods and latent factor models. The neighborhood methods are centered on computing the relationships between items or, alternatively, between users. The item-oriented approach evaluates the preference of a user to an item based on ratings of “neighboring” items by the same user. A product’s neighbors are other products that tend to be scored similarly when rated by the same user. For example, consider the movie “Saving Private Ryan.” Its neighbors might include other war movies, Spielberg movies, and Tom Hanks movies, among others. To predict a particular user’s rating for “Saving Private Ryan,” we would look for the movie’s nearest neighbors that were actually rated by that user. A dual to the item-oriented approach is user-oriented approach, which identifies like-minded users who can complement each other’s missing ratings. Latent factor models comprise an alternative approach that tries to explain the ratings by characterizing both items and users on, say, 20–200 factors inferred from the pattern of ratings. For movies, factors discovered by the decomposition might measure obvious dimensions such as comedy vs. drama, amount of action, or orientation to children; less welldefined dimensions such as depth of character development or “quirkiness,” or completely uninterpretable dimensions. For users, each factor measures how much the user likes movies that score high on the corresponding movie factor. One of the most successful realizations of latent factor models is based on matrix factorization; see, e.g., Koren et al.9 3. TRACKING DRIFTING CUSTOMER PREFERENCES One of the frequently mentioned examples of concept drift is changing customer preferences over time, e.g., “customer preferences change as new products and services become available.”6 This aspect of drifting customer preferences highlights a common paradigm in the literature of having global drifting concepts influencing the data as a whole. However, in many applications, including our focus application of recommender systems, we also face a more complicated form of concept drift where interconnected preferences of many users are drifting in different ways at different time points. This requires the learning algorithm to keep track of multiple changing concepts. In addition the typically low amount of data instances associated with individual customers calls for more concise and efficient learning methods, which maximize the utilization of signal in the data. a p r il 2 0 1 0 vo l . 5 3 n o. 4 c om m u n icat ion s of t he ac m 91
  • 4.research highlights In a survey on the problem of concept drift, Tsymbal19 argues that three approaches can be distinguished in the literature. The instance selection approach discards instances that are less relevant to the current state of the system. A common variant is time-window approaches were only recent instances are considered. A possible disadvantage of this simple model is that it is giving the same significance to all instances within the considered time-window, while completely discarding all other instances. Equal significance might be reasonable when the time shift is abrupt, but less so when time shift is gradual. Thus, a refinement is instance weighting were instances are weighted based on their estimated relevance. Frequently, a time decay ­function is used, underweighting instances as they occur deeper into the past. The third approach is based on ensemble learning, which maintains a family of predictors that together produce the final outcome. Those predictors are weighted by their perceived relevance to the present time point, e.g., predictors that were more successful on recent instances get higher weights. We performed extensive experiments with instance weighting schemes, trying different exponential time decay rates on both neighborhood and factor models. The consistent finding was that prediction quality improves as we moderate that time decay, reaching best quality when there is no decay at all. This finding is despite the fact that users do change their taste and rating scale over the years, as we show later. However, much of the old preferences still persist or, more importantly, help in establishing useful cross-user or cross-product patterns in the data. Thus, just underweighting past actions lose too many signals along with the lost noise, which is detrimental, given the scarcity of data per user. As for ensemble learning, having multiple models, each of which considers only a fraction of the total behavior may miss those global patterns that can be identified only when considering the full scope of user behavior. What makes them even less appealing in our case is the need to keep track of the independent drifting behaviors of many customers. This, in turn, would require building a separate ensemble for each user. Such a separation will significantly complicate our ability to integrate information across users along multiple time points, which is the cornerstone of collaborative filtering. For example, an interesting relation between products can be established by related actions of many users, each of them at a totally different point of time. Capturing such a collective signal requires building a single model encompassing all users and items together. All those considerations led us to the following guidelines we adopt for modeling drifting user preferences. • We seek models that explain user behavior along the full extent of the time period, not only the present behavior (while subject to performance limitations). Such modeling is key to being able to extract signal from each time point, while neglecting only the noise. • Multiple changing concepts should be captured. Some are user-dependent and some are item-dependent. Similarly, some are gradual while others are sudden. 92 comm unicatio ns o f th e acm ap ril 2 0 10 vol . 53 no. 4 • While we need to model separate drifting “concepts” or preferences per user and/or item, it is essential to ­combine all those concepts within a single framework. This combination allows modeling interactions crossing users and items thereby identifying higher level patterns. • In general, we do not try to extrapolate future temporal dynamics, e.g., estimating future changes in a user’s preferences. Extrapolation could be very helpful but is seemingly too difficult, especially given a limited amount of known data. Rather than that, our goal is to capture past temporal patterns in order to isolate persistent signal from transient noise. The result, indeed, helps in predicting future behavior. Now we turn to how these desirable principles are i­ ncorporated into two leading approaches to CF—matrix factorization and neighborhood methods. 4. TIME-AWARE FACTOR MODEL 4.1. The anatomy of a factor model Matrix factorization is a well-recognized approach to CF.9, 11, 17 This approach lends itself well to an adequate modeling of temporal effects. Before we deal with those temporal effects, we would like to establish the foundations of a static factor model. In its basic form, matrix factorization characterizes both items and users by vectors of factors inferred from patterns of item ratings. High correspondence between item and user factors leads to recommendation of an item to a user. More specifically, both users and items are mapped to a joint latent factor space of dimensionality f, such that ratings are modeled as inner products in that space. Accordingly, each user u is associated with a vector pu Î R f and each item i is associated with a vector qi Î R f. A rating is predicted by the rule (1) The major challenge is computing the mapping of each item and user to factor vectors qi, pu Î R f. After this mapping is accomplished, we can easily compute the ratings a user will give to any item by using Equation 1. Such a model is closely related to singular value decomposition (SVD), which is a well-established technique for identifying latent semantic factors in the information retrieval. Applying SVD in the CF domain would require factoring the user–item rating matrix. Such a factorization raises difficulties due to the high portion of missing values, due to the sparseness in the user–item ratings matrix. Conventional SVD is undefined when knowledge about the matrix is incomplete. Moreover, carelessly addressing only the relatively few known entries is highly prone to overfitting. Earlier works13 relied on imputation to fill in missing ratings and make the rating matrix dense. However, imputation can be very expensive as it significantly increases the amount of data. In addition, the data may be considerably distorted due to inaccurate imputation. Hence, more recent
  • 5.works (e.g., Koren, Paterek, and Takacs et al.7, 11, 17) suggested modeling directly only the observed ratings, while avoiding overfitting through an adequate regularized model. In order to learn the factor vectors (pu and qi), we minimize the ­regularized squared error on the set of knownratings:. (2) Minimization is typically performed by stochastic gradient descent. Model (1) tries to capture the interactions between users and items that produce the different rating values. However, much of the observed variation in rating values is due to effects associated with either users or items, independently of their interaction, which are known as biases. A prime example is that typical CF data exhibits large systematic tendencies for some users to give higher ratings than others, and for some items to receive higher ratings than others. After all, some products are widely received as better (or worse) than others. Thus, it would be unwise to explain the full rating value by an interaction of the form qiTpu. Instead, we will try to identify the portion of these values that can be explained by individual user or item effects (biases). The separation of interaction and biases will allow us to subject only the true interaction portion of the data to factor modeling. We will encapsulate those effects, which do not involve user–item interaction, within the baseline predictors. These baseline predictors tend to capture much of the observed signal, in particular much of the temporal dynamics within the data. Hence, it is vital to model them accurately, which enables better identification of the part of the signal that truly represents user–item interaction and should be subject to factorization. A suitable way to construct a static baseline predictor is as follows. Denote by m the overall average rating. A baseline predictor for an unknown rating rui is denoted by bui and accounts for the user and item maineffects:'>effects: