Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Get on with it!
Recommender system industry
challenges move towards real-world,
online evaluation
Padova – March 23th, 2016
Andreas Lommatzsch - TU Berlin, Berlin, Germany
Jonas Seiler - plista, Berlin, Germany
Daniel Kohlsdorf - XING, Hamburg, Germany
CrowdRec - www.crowdrec.eu
• s
Jonas Seiler
http://www.plista.com
Where are recommender
system challenges headed?
Direction 1:
Use info beyond the user-
item matrix.
Direction 2:
Online evaluation +
multiple metrics.
Moving towards real-world evaluation
Flickr credit: rodneycampbell
Why evaluate?
<Images showing “our” use cases>
● plista
● Improve results algorithms
● handle technical constraints
● User Satisfaction
• Evaluation is crucial for the success of real-life systems
• How should we evaluate?
● Improve user satisfaction
● Increase sales, earnings
● Optimize the technical platform for providing the
service
Precision and
Recall
Technical
complexity
Influence
on sales
Required hardware
resources
Business
models
Scalability
Diversity of the
presented results
User
satisfaction
Evaluation Settings
• A static collection of documents
• A set of queries
• A list of relevant documents defined by
experts for each query
Traditional Evaluation in IR
The Cranfield paradigm was designed in the early 1960s when
information access was via Boolean queries against manually indexed
documents and there was (virtually) no text online. Cyril Cleverdon,
Librarian of the College of Aeronautics, Cranfield, England, built a test
collection that modeled university researchers, including abstracts of
aeronautical papers, one-line queries based on questions gathered
from the researchers, and complete relevance judgments for each
query submitted by these users. The idea of carefully modeling some
user application continued with Prof. Gerard Salton and the SMART
collections, such as searching MEDLINE abstracts using real questions
submitted to MEDLINE, or searching full text TIME articles with real
questions from several sources, etc. A 1969 paper by Michael Lesk
and Salton used experiments on the ISPRA collection to show that
relevance judgments made by a person who was not the user would
still allow valid system comparison, a precursor to the paper by Ellen
Voorhees in SIGIR 1998.
IR based on static
collections
A set of queries. For each
query there is a list of
relevant documents
defined by experts
Reproducible setting
All researches have
exactly the same
information
“The Cranfield paradigm”
Advantages
• Reproducible setting
• All researches have exactly the same
information
• Optimized for measuring precision
Query0
* #nn
* #nn
* #nn
Traditional Evaluation in IR
Weaknesses of traditional IR evaluation
• High costs for creating dataset
• Datasets are not up-to-date
• Domain-specific documents
• The expert-defined ground truth does not
consider individual user preferences
• Individual user preferences
• Context-awareness is not considered
• Technical aspects are ignored
Context is
everything
Industry and recsys challenges
• Challenges benefit both industry and academic research.
• We look at how industry challenges have evolved since
the Netflix prize 2009.
Traditional Evaluation in RecSys
Rating prediction
Cross-validation
Individual User prefences /
personalization
Large dataset
sparcity
Evaluation Settings
• Rating prediction on user-item matrices
• Large, sparse dataset
• Predict personalized ratings
• Cross-validation, RMSE
Advantages
• Reproducible setting
• Personalization
• Dataset is based on
real user ratings
“The Netflix paradigm”
Traditional Evaluation in RecSys
Weaknesses of traditional Recommender evaluation
• Static data
• Only one type of data - only user ratings
• User ratings are noisy
• Temporal aspects tend to be ignored
• Context-awareness is not considered
• Technical aspects are ignored
Static data
Context is not taken into account
Crossvalidation does not match real-life settings
Why Netflix did not implement the winner https://www.techdirt.com/blog/innovation/articles/20120409/03
412518422/why-netflix-never-implemented-algorithm-that-
won-netflix-1-million-challenge.shtml
Challenges of Developing Applications
Challenges
• Data streams - continuous changes
• Big data
• Combine knowledge from different sources
• Context-Awareness
• Users expect personally relevant results
• Heterogeneous devices
• Technical complexity, real-time requirements
How to address these challenges in the Evaluation?
• Realistic evaluation setting
– Heterogeneous data sources
– Streams
– Dynamic user feedback
• Appropriate metrics
– Precision and User satisfaction
– Technical complexity
– Sales and Business models
• Online and Offline Evaluation
How to Setup a better Evaluation?
● Online Evaluation
● Consider the context
● Data streams
● Business model-oriented metrics
Approaches for a better Evaluation
• News recommendations
@ plista
• Job recommendations
The plista Recommendation Scenario
Setting
● 250 ms response time
● 350 Mio AI/day
● In 10 Countries
Challenges
● News change
continuously
● User do not log-in
explicitly
● Seasonality, context-
depend user
preferences
Offline
• Cross-validation
– Metric Optimization Engine
(https://github.com/Yelp/MOE)
– Integration into Spark
• How well does it correlate with
Online Evaluation?
• Time Complexity
Evaluation @ plista
Online
• AB Tests
– Limited
• by Caching Memory
• Computational
Resources
– MOE*
Offline
• Mean and variance estimation of parameter space with
Gaussian Process
• Evaluate parameter with highest Expected Improvement (EI),
Upper Confidence Interval ….
• Rest API
Evaluation using MOE
Online
• A/B Tests are expensive
• Model non-stationarity
• Integrate out non-stationarity
to get mean EI
Evaluation using MOE
Provide an API enabling researchers testing own ideas
• The CLEF-NewsREEL challenge
• A Challenge in CLEF (Conferences and Labs of the Evaluation Forum)
• 2 Tasks: Online and Offline Evaluation
The CLEF-NewsREEL challenge
How does the challenge work?
• Live streams consisting of impressions, requests, and
clicks, 5 publishers, approx 6 Million messages per day
• Technical requirements: 100 ms per request
• Live evaluation
based on CTR
CLEF-NewsREEL
Online Task
Online vs. Offline Evaluation
• Technical aspects can be evaluated without user feedback
• Analyze the required resources and the response time
• Simulate the online evaluation by replaying a recorded
stream
CLEF-NewsREEL
Offline Task
Challenge
• Realistic simulation of streams
• Reproducible setup of computing environments
Solution
• A framework simplifying
the setup of the evaluation
environment
• The Idomaar framework developed in the CrowdRec project
CLEF-NewsREEL
Offline Task
http://rf.crowdrec.eu
More Information
• SIGIR forum Dec 2015 (Vol 49, #2)
http://sigir.org/files/forum/2015D/p129.pdf
Evaluate your algorithm online and offline in NewsREEL
• Register for the challenge!
http://crowdrec.eu/2015/11/clef-newsreel-2016/
(register until 22nd of April)
• Tutorials and Templates are provided at orp.plista.com
CLEF-NewsREEL
https://recsys.xing.com/
XING - RecSys Challenge
Job Recommendations @ XING
XING - Evaluation based on interaction
● On Xing users can give feedback on recommendations.
● Number of user feedback way lower than implicit measures.
● A/B Tests focus on clickthrough rate.
XING - RecSys Challenge, Scoring,
Space on Page
● Predict 30 items for each user.
● Score: weighted combination of
the precision
○ precisionAt(2)
○ precisionAt(4)
○ precisionAt(6)
○ precisionAt(20)
Top 6
XING - RecSys Challenge, User Data
• User ID
• Job Title
• Educational Degree
• Field of Study
• Location
XING - RecSys Challenge, User Data
• Number of past jobs
• Years of Experience
• Current career level
• Current discipline
• Current industry
XING - RecSys Challenge, Item Data
• Job title
• Desired career level
• Desired discipline
• Desired industry
XING - RecSys Challenge, Interaction Data
• Timestamp
• User
• Job
• Type:
– Deletion
– Click
– Bookmark
XING - RecSys Challenge, Anonymization
XING - RecSys Challenge, Anonymization
XING - RecSys Challenge, Future
• Live Challenge
– Users submit predicted future interactions
– The solution is recommended on the platform
– Participants get points for actual user clicks
Release to Challenge Collect Clicks
Work On Predictions
Score
How to setup a better Evaluation
• Consider different quality criteria
(prediction, technical, business models)
• Aggregate heterogeneous information sources
• Consider user feedback
• Use online and offline analyses
to understand users and their
requirements
Concluding ...
Participate in challenges based on real-life scenarios
• NewsREEL challenge
Concluding ...
• RecSys 2016 challenge
=> Organize a challenge. Focus on real-life data.
http://orp.plista.com
http://2016.recsyschallenge.com/
More Information
• http://www.crowdrec.eu
• http://www.clef-newsreel.org
• http://orp.plista.com
• http://2016.recsyschallenge.com
• http://www.xing.com
Thank You