Next-Basket Prediction using Bidirectional Transformers

Oct 2025 · Machine Learning · Sabanci University

TransformersBERT4RecDeep LearningRecommender SystemsE-Commerce

Supervisor: Dr. Yücel Saygın

Overview

Next-basket recommendation (NBR) asks: given a customer’s complete purchase history, what items will appear in their next shopping basket? Unlike single-item next-click prediction, NBR must infer an unordered set of potentially co-dependent items, a substantially harder problem that more closely mirrors actual consumer behavior, where purchases of complementary items (recipe ingredients, fashion ensembles, printer consumables) co-occur within the same transaction.

This project develops a progressive enhancement framework for NBR: beginning from a recurrent neural network baseline, each architectural and data-level modification is independently validated before the next is introduced, producing a clean ablation that isolates the contribution of each component.

Baseline and Enhancements

Baseline: Gated Recurrent Unit (GRU)

The starting architecture is a standard GRU encoder that processes the sequence of historical baskets and outputs next-basket predictions as a multi-label classification problem. GRUs capture temporal dependencies but are inherently unidirectional: the model can only condition predictions on past purchases, not on the full sequential context.

Enhancement 1: Bidirectional Transformer (BERT4Rec)

The GRU encoder is replaced with a bidirectional Transformer following the BERT4Rec architecture, which employs a Cloze-style masking objective: randomly masked items within the purchase sequence must be predicted from both left and right context simultaneously. This bidirectionality allows the model to capture co-purchase dependencies that a unidirectional encoder cannot. If a user systematically buys items A and C together, masking B and observing both neighbors enables the model to infer that structure.

Enhancement 2: Structured Data Augmentation

Four augmentation strategies address data sparsity, a pervasive challenge in behavioral recommendation datasets where many users have short or irregular purchase histories:

Technique	Mechanism
Item masking	Randomly mask items within baskets during training, creating additional masked prediction targets
Sequence cropping	Truncate purchase histories to variable lengths, forcing robustness to incomplete histories
Sequence reversing	Train on reversed purchase sequences, regularizing temporal ordering assumptions
Intra-basket sorting	Reorder items within baskets under different criteria, diversifying co-occurrence signals

These augmentations multiply the effective training signal for sparse users and reduce overfitting to specific sequential patterns.

Experimental Results

Experiments were conducted on the Retail Rocket dataset, a large-scale e-commerce clickstream and purchase dataset. Performance is measured by Recall@10, the fraction of true next-basket items appearing in the top-10 predictions:

Model	Recall@10
GRU baseline	0.0177
Transformer + Augmentation	0.0658
BERT4Rec standalone	0.0731

The Transformer architecture alone produced the dominant share of improvement, with data augmentation providing further gains in generalization. BERT4Rec standalone achieved the highest Recall@10; the combined Transformer + augmentation system offered the best robustness-performance balance across user history lengths.

Key Finding

Bidirectional context modeling, which conditions item predictions on both preceding and subsequent purchases within the masked training objective, substantially outperforms forward-only sequential models for capturing the co-purchase structure of shopping baskets. The 4× improvement over the GRU baseline (0.0177 → 0.0731) demonstrates that architectural choice is the dominant factor, with data augmentation contributing meaningful additional generalization on sparse histories.

The progressive enhancement methodology itself is a transferable contribution: it provides a rigorous framework for isolating the source of performance gains in recommendation systems, as opposed to reporting aggregate improvements from bundles of simultaneous changes.