Get on with it! - DAI-Labor · Get on with it! Recommender system industry challenges move towards real-world, online evaluation Padova – March 23th, 2016 Andreas Lommatzsch - TU

Get on with it!

Recommender system industry

challenges move towards real-world,

online evaluation

Padova – March 23th, 2016

Andreas Lommatzsch - TU Berlin, Berlin, Germany

Jonas Seiler - plista, Berlin, Germany

Daniel Kohlsdorf - XING, Hamburg, Germany

CrowdRec - www.crowdrec.eu

www.crowdrec.eu

www.crowdrec.eu

• Andreas

Andreas Lommatzsch

[email protected]

http://www.dai-lab.de

• s

Jonas Seiler

[email protected]

http://www.plista.com

mailto:[email protected]


• Daniel

Daniel Kohlsdorf

[email protected]

http://www.xing.com


Where are recommender

system challenges headed?

Direction 1:

Use info beyond the user-

item matrix.

Direction 2:

Online evaluation +

multiple metrics.

Moving towards real-world evaluation

Flickr credit: rodneycampbell

Why evaluate?

<Images showing “our” use cases>

● plista

● XING

● Improve results algorithms

● handle technical constraints

● User Satisfaction

• Evaluation is crucial for the success of real-life systems

• How should we evaluate?

● Improve user satisfaction

● Increase sales, earnings

● Optimize the technical platform for providing the

service

Precision and

Recall

Technical

complexity

Influence

on sales

Required hardware

resources

Business

models

Scalability

Diversity of the

presented results

User

satisfaction

Evaluation Settings

• A static collection of documents

• A set of queries

• A list of relevant documents defined by

experts for each query

Traditional Evaluation in IR

The Cranfield paradigm was designed in the early 1960s when

information access was via Boolean queries against manually indexed

documents and there was (virtually) no text online. Cyril Cleverdon,

Librarian of the College of Aeronautics, Cranfield, England, built a test

collection that modeled university researchers, including abstracts of

aeronautical papers, one-line queries based on questions gathered

from the researchers, and complete relevance judgments for each

query submitted by these users. The idea of carefully modeling some

user application continued with Prof. Gerard Salton and the SMART

collections, such as searching MEDLINE abstracts using real questions

submitted to MEDLINE, or searching full text TIME articles with real

questions from several sources, etc. A 1969 paper by Michael Lesk

and Salton used experiments on the ISPRA collection to show that

relevance judgments made by a person who was not the user would

still allow valid system comparison, a precursor to the paper by Ellen

Voorhees in SIGIR 1998.

IR based on static

collections

A set of queries. For each

query there is a list of

relevant documents

defined by experts

Reproducible setting

All researches have

exactly the same

information

“The Cranfield paradigm”

Advantages

• Reproducible setting

• All researches have exactly the same

information

• Optimized for measuring precision

Query0

* #nn

* #nn

* #nn

Traditional Evaluation in IR

Weaknesses of traditional IR evaluation

• High costs for creating dataset

• Datasets are not up-to-date

• Domain-specific documents

• The expert-defined ground truth does not

consider individual user preferences

• Individual user preferences

• Context-awareness is not considered

• Technical aspects are ignored

Context is

everything

Industry and recsys challenges

• Challenges benefit both industry and academic research.

• We look at how industry challenges have evolved since

the Netflix prize 2009.

Traditional Evaluation in RecSys

Rating prediction

Cross-validation

Individual User prefences /

personalization

Large dataset

sparcity

Evaluation Settings

• Rating prediction on user-item matrices

• Large, sparse dataset

• Predict personalized ratings

• Cross-validation, RMSE

Advantages

• Reproducible setting

• Personalization

• Dataset is based on

real user ratings

“The Netflix paradigm”

Traditional Evaluation in RecSys

Weaknesses of traditional Recommender evaluation

• Static data

• Only one type of data - only user ratings

• User ratings are noisy

• Temporal aspects tend to be ignored

• Context-awareness is not considered

• Technical aspects are ignored

Static data

Context is not taken into account

Crossvalidation does not match real-life settings

Why Netflix did not implement the winner https://www.techdirt.com/blog/innovation/articles/20120409/03

412518422/why-netflix-never-implemented-algorithm-that-

won-netflix-1-million-challenge.shtml

https://www.techdirt.com/blog/innovation/articles/20120409/03412518422/why-netflix-never-implemented-algorithm-that-won-netflix-1-million-challenge.shtml























Challenges of Developing Applications

Challenges

• Data streams - continuous changes

• Big data

• Combine knowledge from different sources

• Context-Awareness

• Users expect personally relevant results

• Heterogeneous devices

• Technical complexity, real-time requirements

How to address these challenges in the Evaluation?

• Realistic evaluation setting

– Heterogeneous data sources

– Streams

– Dynamic user feedback

• Appropriate metrics

– Precision and User satisfaction

– Technical complexity

– Sales and Business models

• Online and Offline Evaluation

How to Setup a better Evaluation?

● Online Evaluation

● Consider the context

● Data streams

● Business model-oriented metrics

Approaches for a better Evaluation

• News recommendations

@ plista

• Job recommendations

@ XING

The plista Recommendation Scenario

Setting

● 250 ms response time

● 350 Mio AI/day

● In 10 Countries

Challenges

● News change

continuously

● User do not log-in

explicitly

● Seasonality, context-

depend user

preferences

Offline

• Cross-validation

– Metric Optimization Engine

(https://github.com/Yelp/MOE)

– Integration into Spark

• How well does it correlate with

Online Evaluation?

• Time Complexity

Evaluation @ plista

Online

• AB Tests

– Limited

• by Caching Memory

• Computational

Resources

– MOE*

https://github.com/Yelp/MOE

Offline

• Mean and variance estimation of parameter space with

Gaussian Process

• Evaluate parameter with highest Expected Improvement (EI),

Upper Confidence Interval ….

• Rest API

Evaluation using MOE

Online

• A/B Tests are expensive

• Model non-stationarity

• Integrate out non-stationarity

to get mean EI

Evaluation using MOE

Provide an API enabling researchers testing own ideas

• The CLEF-NewsREEL challenge

• A Challenge in CLEF (Conferences and Labs of the Evaluation Forum)

• 2 Tasks: Online and Offline Evaluation

The CLEF-NewsREEL challenge

How does the challenge work?

• Live streams consisting of impressions, requests, and

clicks, 5 publishers, approx 6 Million messages per day

• Technical requirements: 100 ms per request

• Live evaluation

based on CTR

CLEF-NewsREEL

Online Task

Online vs. Offline Evaluation

• Technical aspects can be evaluated without user feedback

• Analyze the required resources and the response time

• Simulate the online evaluation by replaying a recorded

stream

CLEF-NewsREEL

Offline Task

Challenge

• Realistic simulation of streams

• Reproducible setup of computing environments

Solution

• A framework simplifying

the setup of the evaluation

environment

• The Idomaar framework developed in the CrowdRec project

CLEF-NewsREEL

Offline Task

http://rf.crowdrec.eu

http://rf.crowdrec.eu

More Information

• SIGIR forum Dec 2015 (Vol 49, #2)

http://sigir.org/files/forum/2015D/p129.pdf

Evaluate your algorithm online and offline in NewsREEL

• Register for the challenge!

http://crowdrec.eu/2015/11/clef-newsreel-2016/

(register until 22nd of April)

• Tutorials and Templates are provided at orp.plista.com

CLEF-NewsREEL

http://sigir.org/files/forum/2015D/p129.pdf






https://recsys.xing.com/

XING - RecSys Challenge

Job Recommendations @ XING

XING - Evaluation based on interaction

● On Xing users can give feedback on recommendations.

● Number of user feedback way lower than implicit measures.

● A/B Tests focus on clickthrough rate.

XING - RecSys Challenge, Scoring,

Space on Page

● Predict 30 items for each user.

● Score: weighted combination of

the precision

○ precisionAt(2)

○ precisionAt(4)

○ precisionAt(6)

○ precisionAt(20)

Top 6

XING - RecSys Challenge, User Data

• User ID

• Job Title

• Educational Degree

• Field of Study

• Location

XING - RecSys Challenge, User Data

• Number of past jobs

• Years of Experience

• Current career level

• Current discipline

• Current industry

XING - RecSys Challenge, Item Data

• Job title

• Desired career level

• Desired discipline

• Desired industry

XING - RecSys Challenge, Interaction Data

• Timestamp

• User

• Job

• Type:

– Deletion

– Click

– Bookmark

XING - RecSys Challenge, Anonymization

XING - RecSys Challenge, Anonymization

XING - RecSys Challenge, Future

• Live Challenge

– Users submit predicted future interactions

– The solution is recommended on the platform

– Participants get points for actual user clicks

Release to Challenge Collect Clicks

Work On Predictions

Score

How to setup a better Evaluation

• Consider different quality criteria

(prediction, technical, business models)

• Aggregate heterogeneous information sources

• Consider user feedback

• Use online and offline analyses

to understand users and their

requirements

Concluding ...

Participate in challenges based on real-life scenarios

• NewsREEL challenge

Concluding ...

• RecSys 2016 challenge

=> Organize a challenge. Focus on real-life data.

http://orp.plista.com

http://2016.recsyschallenge.com/

More Information

• http://www.crowdrec.eu

• http://www.clef-newsreel.org

• http://orp.plista.com

• http://2016.recsyschallenge.com

• http://www.xing.com

Thank You

Documents

Get on with it! - DAI-Labor · Get on with it! Recommender system industry challenges move towards real-world, online evaluation Padova – March 23th, 2016 Andreas Lommatzsch - TU