Sanity Checking Product Recommendations Before Production
In a previous blog post, Jon spoke about our ventures into the world of personalisation within Depop . He described one of our new features, product recommendations, an example of one of our machine learning projects. As part of the delivery, we had to answer the following…
How do we test that?
This question should always be at the forefront of your mind when planning a new feature. We need to be certain that anything we push live has no bugs or performance issues, as after all, they cause our users a bad experience and we want to avoid this at all cost. However, the answer isn’t always that straight forward. Sometimes, a feature can be covered by a few simple unit tests, an integration test but other times it involves a lot more thinking. A machine learning project is no different. Testing a machine learning (ML) algorithm isn’t easy. The size of the output data we are using means validating each output is almost impossible and very often, as in the case of our product recommendations, the outputs of a ML model are subjective to users’ preferences. For example, imagine the output of a model is “teal vintage top”. One person might think it’s green not teal, another high-street not vintage and another may think it’s not even a top. There’s no real way we can test this as a traditional unit or integration test, so we have to think harder.
One of the most widely accepted ideas is not to measure the output of the model itself but more the impact it has upon related business metrics in production. For our product recommendations, how does a change to the algorithm increase the number of views, likes or purchases of products across the platform. Tracking these values over a period of time can enable us to have a good idea about the performance of our model. This, however is an unusual workflow in automated software testing. Normally, we develop a feature locally, then test it in our staging environment before pushing it into production. We started to think of whether it was possible to perform worthwhile tests on ML algorithms in staging, before they make it to production.
We took product recommendations system as our guinea pig. As Jon explained, the algorithm relies on a user’s interactions (like, save, comment, message etc..). We have some criteria based upon these interactions such that if a user satisfies them then we recommend them products. Simple. However, we had no users in staging that would be included.
What are the differences between our staging and production environments?
- The data is made up
Our staging environment is regularly populated with mock data through our testing frameworks. However, a lot of it isn’t useful for testing ML models. For instance, when testing whether a product contains a description, any string like “foo” will suffice, however this doesn’t aid us when we try to make predictions based upon the content of that description. When combining all of the elements of the mock data, it often doesn’t build a very realistic picture of a real life user or product, in particular, any given user doesn’t have an associated style or preference.
- The data isn’t dynamic
Once the users and products have been created in staging, they have a short lifespan. They may be instructed to perform a short number of actions and then after that, they become dormant. In production, our users come back to the app regularly to interact with products. We need our staging users to be active enough to be included in our recommendations.
- The size of data
The size of our staging environment is minute compared to production. We have about half a million products in staging compared with 100 million in production. As with most ML algorithms, the larger the data set, the more effective it will be.
In order to test our model effectively, we needed address to these.
How did we do it?
We need to create mock data that satisfies our inclusion rules for recommendations and as best as possible, mimics production data. Our goal for recommendations is that similar users should interact with similar products, so if we can group our users in a structured way, we can test that the output of our recommendations are sane. We took the user and product ids to partition our user and product base. For example, we can calculate the value of an id modulo 5 and we partition into 5 classes 0, 1, 2, 3, 4. Below we have two cartoon characters, a goth and hipster. The goth becomes a U0 user and the hipster a U1 user.
We can think of these buckets of users as users with a similar style. With our assumption that similar users interact with similar products, we can construct a weighted mapping between user classes and product classes in order to represent the preferences of a user. Take user class 0 (U0).
Suppose that the black boots represent product class P0. Then we’d expect that our goth character would be interested in this product because it fits their style, and we assign a probability of interaction, say 95%. We continue assigning probabilities between U0’s and the product classes.
In general, a user of class n should be interested in product class n mod 5 the most, and n + 4 mod 5 the least.
| Un | Pn | Probability of preference |
| U0 | P0 | 95 |
| U0 | P1 | 50 |
| U0 | P2 | 15 |
| U0 | P3 | 5 |
| U0 | P4 | 1 |
| U1 | P0 | 1 |
| U1 | P1 | 95 |
| ... | ... | ... |
Next we need to create the interactions between these users and construct an artificial social graph. In order to do so, we created an AWS Lambda function that runs daily.
- First it creates the users and products in the above fashion of an even class distribution.
- Next for each user, it randomly selects a product from the pool of created products.
- Then it decides whether to interact based on the preference relationships we designed above.
- We calculate a random number between 0 and 100 and use the preference probability between a user and product as an threshold to control whether a random type of Depop product interaction (like, save…) occurs.
if (randomNumber < preferenceProbability)
Suppose we had the pair of a U0 and a P2. The preference probability between this pair is 15%. If our random number is less than 15, we perform an interaction, if not nothing happens and we repeat the process. Using this methodology and a suitable number of users and products, we construct a large enough graph to meet our criteria to be included in our recommendation algorithm. When our Spark job runs in staging, these users will be eligible and will be recommended products.
The complete list of user and products are logged into a slack channel, so members of the engineering team are able to log in to those users accounts, allowing them to check other aspects of the recommendations process.
Now that we have predetermined preferences between users and products, we’d hope to see a skewed distribution in the relationship between user and product class in recommendations.
So does it actually work?
We can extract the results that we generated for the users and then count the number of recommendations from each product class.
Ie for U0's.
| Product Class | Number of Recommendations |
| P0 | 100 |
| P1 | 30 |
| P2 | 10 |
| P3 | 6 |
| P4 | 8 |
By aggregating each users distribution of recommended products in order of their personal product class preference, we can check whether our algorithm produces sensible results.
Well, the distribution wasn’t as skewed as we might have expected, it seemed fairly evenly distributed. After some careful thought, we thought that the graphs that we created weren’t representative of the ones we see in the wild. Our users belong to much more sparse communities and the graph we create in our staging interactions lambda is highly connected.
This could result in almost every product being recommended to each user in the cluster.
About 30% of products were from the user’s most preferred product class, and 25% from the second highest recommended.
How can we use it?
Now suppose we wanted to make a change to our algorithm. We can deploy it to staging, run our interactions lambda and calculate these same metrics. We can compare the percentages of recommendations to each class and have a sanity check of our assumption, do similar users get recommended similar products.
Suppose we release another version of our recommendations feature. We calculate the distribution of recommendations and see that is completely different from our stable version. Similar users are being recommended products from their least preferred class. This is equivalent to our goth being recommended the Peppa Pig slippers. This will give our users a bad experience and maybe deter them from revisiting Depop in the future. We should not proceed with testing this in production.
You can check out our open roles here and follow our LinkedIn page. If you have a background in machine learning but don’t see the perfect role for you, or just want to pop by our offices for a chat and a tour get in touch: email@example.com