Site Search Bench-marking via Crowd Sourcing

Objective:

Our objective is to device crowdsourcing platform find the relevancy of SRP and benchmark the with competitor. Apart from competitor we could also gauge search result relevancy with production and new releases planned.

Since Site search contributes to 60%+ revenues, It is essential to gauge search relevance and device a systematic approach for the same. The crowdsourcing platform could provide us optimal result in a very short span of time,

Query Set Preparation:

A comprehensive query set (2500 in number) , which will be a representative of overall search queries on our website (this set will form a significant %age of our key metrics – visits and revenue). It will consist of:

  • Top 450-500 search queries based on visits, since they represent 20-25% of the total visit count, and are quite important
  • From the next 27k queries (which represent 40% of the search visits), we will randomly choose samples from all deciles/quartiles
  • Randomly select queries from the long tail

Now, we will categorize this query set in the below mentioned buckets (and all possible combinations). The key objective here is that QUL should be able to correctly label these params within a search query.
Technical Buckets

1    Product name/PRODUCT

2    Brand

3    Cat/ Sub-Cat

4    Brand + Cat/Sub-Cat

5    PRODUCT+ Cat/Sub-Cat

6    Cat/ Sub-Cat+ Highlights/Attributes/Filter values

7    PRODUCT+ Cat/Sub-Cat+ Highlights/Attributes/Filter values

8    Brand + Cat/Sub-Cat+ Highlights/Attributes/Filter values

9    Brand+ Highlights/Attributes/Filter values

10    Highlights/Attributes/Filter values

Each Buckets will be represented with at least 150 queries
Category / Sub  Category Buckets(Top interms of revenues/orders)

  1. Mobiles & Tablets
  • Computers & Peripherals
  • Appliances
  • TVs, Audio & Video
  • Men’s Fashion
  • Men’s Footwear
  • Women’s Clothing
  • Watches
  • Bags & Luggage
  • Automotive
  • Kitchenware
  • Home Furnishing
  • Toys & Games
  • Sports & Fitness

Each Buckets will be represented with at least 100 queries,

Find out best practices if site Search

Input to Crowdsourcing Platform

The test user will be provided with UI where query and respective top 10 -20 results (?) is shown.

The results will be shown as per our business objective:

  1. New Release Plan
    Product Result and Test avatar results will be shown
  2. Competitor Benchmarking

Product Results for host site & similar sites like flipkart, eBay and amazon will be displayed.

Product and Competitor results will be shown

User Interface

111.png

  1. Most Relevant , Best Result : Totally relevant

: The document completely answers the question.

  1. Relevant , Good Result : Partly relevant

: The information in the document is relevant to the question but not complete.

  1. Irrelevant , Somewhere Close: Related

: The document mentions the subject or holds potentially good hyperlinks to relevant pages, but does not contain any actual information regarding the query itself

  1. Completely Irrelevant, Useless: Not relevant/Spam

: The document is off topic or spam, not giving information about the subject.

Serial No. Subject Question Asked Scores Assigned
1 Most Relevant Best Result 3
2 Relevant Good Result 2
3 Irrelevant Somewhere Close 1
4 Completely Irrelevant Irrelevant, Useless 0

Output from CrowdSourcing Platform

The user will be asked to mark each PRODUCTs wrt to query provided, and the score will be stored in following format a query:

Query PRODUCTS Rank User Scoring
Q1 PRODUCT1 1 3
PRODUCT2 2 0
PRODUCTn 4 2
Q2 PRODUCT4 1 3
PRODUCT9 2 0
PRODUCT11 4 2
PRODUCTz 1 3

For multiple user feedback for same query weighted average rank and weighted average user scoring will be considered.

Post Analysis of Scores

We ll be calculating Normalized Discounted Cumulative Gain for each query(explained below), once we have computed NDCG values for each query, we can average them across thousands of queries. We can now compare two algorithms: we take the mean average NDCG values for each, and check using a statistical test (such as a two sided t-test) whether one algorithm is better than the other, and with what confidence.

Calculation of Cumulative Gains

Cumulative Gain (CG) is does not include the position of a result in the consideration of the usefulness of a result set. In this way, it is the sum of score values of all PRODUCTs for aquery. The CG at a particular rank position p is defined as:

222.png

Calculation for provided data

After Computing weighted average scores for rank and scores, the CGs score as follows

PRODUCTS Rank User Scoring CG
PRODUCT1 1 3 3
PRODUCT2 2 0 3
PRODUCT3 3 2 5
PRODUCT4 4 3 8
PRODUCTn 5 1 9

Calculation of Discounted Cumulative Gains

The premise of DCG is that highly relevant documents appearing lower in a search result list should be penalized as the score value is reduced logarithmically proportional to the position of the result. The discounted CG accumulated at a particular rank position is given by:

Rel1-> Score for top position PRODUCT,

Reli -> for any position PRODUCT

Calculation for provided data

PRODUCTS
Rank(i)
User Scoring(rel)
CG
Log2(i)
rel/ Log2(i)
DCG
PRODUCT1
1
3
3
0
N/A
3
PRODUCT2
2
2
5
1
2
5
PRODUCT3
3
3
8
1.585
1.892
6.892
PRODUCT4
4
0
8
2
0
6.892
PRODUCTn
5
1
9
2.322
0.431
7.323


Calculation of Normalized Discounted Cumulative Gains

The Normalized part in NDCG allows us to compare DCG values between different queries.

Search result lists vary in performance depending upon query  . It’s not fair to compare DCG values across queries because some queries are easier than others: for example, maybe it’s easy to get four perfect results for the query samsung s4, and much harder to get four perfect results for short micro usb cable .

This done normalizing DCG wrt to ideal discounted cumulative gains (IDCG), which is the best possible score is given the results we’ve seen so far.

Example :  the best scores for query with increasing order of rank are

3 3 2 2 0, then IDCG = 8.01

Our NDCG is the score we  for given set DCG divided by the ideal DCG Now we can compare scores across queries, since we’re comparing percentages of the best possible arrangements and not the raw scores

PRODUCTS
Rank(i)
User Scoring(rel)
CG
Log2(i)
rel/ Log2(i)
DCG
NDCG
PRODUCT1
1
3
3
0
N/A
3
0.37
PRODUCT2
2
2
5
1
2
5
0.62
PRODUCT3
3
3
8
1.585
1.892
6.892
0.86
PRODUCT4
4
0
8
2
0
6.892
0.86
PRODUCTn
5
1
9
2.322
0.431
7.323
0.91

NDCG = 0.91

Final Score Calculation For All set of queries

Once we’ve computed NDCG values for each query, we can average them across thousands of queries.

Testing Across various setup

Once the score is calculated of each setup (production, avatar,  competitor) . The two setup  algorithm will be compared using a statistical test (such as a two sided t-test) whether one algorithm is better than the other, and with what confidence.

Saugata is Senior Product Manager at a leading retail eccomerce  player and consultant with Zombie Software. You can connect on LinkedIn.
For a free consultation with a member of our team call us now on +971-544177921  or  send query via this link / email . “

Leave a Reply

Your email address will not be published. Required fields are marked *