Benchmarking TabPFN against XGboost and Catboost on Kaggle datasets
This study benchmarks CatBoost, XGBoost, and TabPFN across multiple Kaggle datasets, including classification, regression, and time series tasks. TabPFN showed strong performance on small datasets, often outperforming the boosting models without any feature engineering or preprocessing. However, it struggles with large datasets over 10,000 rows due to memory limitations, where XGBoost remains more effective, particularly for time series forecasting. Overall, TabPFN is a powerful tool for quick experimentation on smaller tabular datasets but is less practical for large-scale applications.
1. Objective
The objective of this study is to benchmark the performance of three models — CatBoost,
XGBoost, and TabPFN — across multiple Kaggle datasets. The aim is to evaluate how these
models perform on diverse tasks including binary classification, regression, and time series
forecasting. The focus is on comparing predictive accuracy, error metrics, and general
applicability of TabPFN (a transformer-based model designed for tabular data) against
well-established gradient boosting models.
- Note: For comparison purposes, we did not perform feature engineering, missing
value handling, or extensive data cleaning. All models were trained on the raw
datasets to observe their out-of-the-box capabilities.
2. Datasets
The experiments were conducted on four Kaggle datasets covering different predictive modeling
tasks. A summary of each dataset is provided below:
| Dataset Name | Task Type | Target Variable | Number of Samples | Number of Features | Notes |
|---|---|---|---|---|---|
| Titanic - Machine Learning from Disaster | Binary Classification | Survived (0/1) | 891 (train set) | 11 (after cleaning) | Predict passenger survival |
| House Prices - Advanced Regression Techniques | Regression | SalePrice | 1,460 | 79 | Predict house sale prices |
| Binary Prediction with a Rainfall Dataset | Binary Classification | RainTomorrow (Yes/No) | 2,190 | 12 | Predict if it will rain tomorrow |
| Forecasting Sticker Sales | Time Series Forecasting | num_sold | 230,130 | 5 (date, country, store, product, id) | Predict sticker sales; missing values present in num_sold |
3. Results
The models were evaluated based on metrics appropriate to each task. For the House Prices
dataset, RMSLE (Root Mean Squared Log Error) was used in accordance with Kaggle’s official
evaluation method
- Note: Due to TabPFN’s limitation of handling a maximum of 10,000 rows, for large
datasets (Sticker Sales), we performed random sampling of 10,000 entries and
further split them into training and validation sets. CatBoost and XGBoost were also
trained on the same sampled data to ensure a fair comparison.
| Dataset Name | Metric | CatBoost | XGBoost | TabPFN | Notes |
|---|---|---|---|---|---|
| Titanic - Machine Learning from Disaster | Accuracy | 0.77511 | 0.73584 | 0.78947 | |
| House Prices - Advanced Regression Techniques | RMSLE (Root Mean Squared Log Error) | 0.1283 | 0.15554 | 0.11359 | Evaluated on log(predictions)vslog(actual) as per Kaggle |
| Binary Prediction with a Rainfall Dataset | AUC ROC | 0.84231 | 0.8458 | 0.86564 | |
| Forecasting Sticker Sales | RMSE | 102.08 | 89.56 | 99.24 | All models were trained on a sampled dataset of 10,000 entries. |
4. Observations
- Titanic Dataset: TabPFN slightly outperformed CatBoost and XGBoost in terms of
accuracy, showing good performance on smaller classification tasks without feature
engineering. - House Prices: TabPFN achieved the lowest RMSLE, indicating strong generalization
ability for regression problems with log-transformed targets, even when trained on raw
data. - Rainfall Dataset: TabPFN achieved the highest AUC ROC, outperforming both boosting
models despite no additional data preprocessing or feature engineering. - Sticker Sales: XGBoost showed the best performance on time series forecasting.
TabPFN performed reasonably well, but its performance lagged
5. Limitations
During this benchmarking study, the following limitations were encountered:
- TabPFN Row Limitations: TabPFN officially supports datasets with up to 10,000 rows.
- Handling Large Datasets:
- For the Sticker Sales dataset (230,130 samples), attempts to train on the full
dataset by ignoring the 10,000-row limit resulted in failures:- On GPU (3.8 GB VRAM), the training failed due to out-of-memory errors.
- On CPU, running TabPFN on the full dataset caused the kernel to crash.
- For the Sticker Sales dataset (230,130 samples), attempts to train on the full
- Therefore, for Sticker Sales, we trained all models on a 10,000-entry random sample to
ensure comparability.
6. Conclusion
This benchmarking study demonstrates that TabPFN, trained on large-scale synthetic data and
leveraging in-context learning, can outperform or match established models like CatBoost and
XGBoost on small datasets with minimal preprocessing. Its ability to handle raw tabular data
without requiring feature engineering makes it a powerful tool for quick experimentation and
baseline modeling.
However, significant limitations arise when working with larger datasets beyond 10,000 rows, as
noted by the model’s creators. Attempts to run TabPFN on large datasets resulted in memory
overloads and kernel crashes, making it impractical for large-scale training.
In summary:
- TabPFN is highly competitive or even superior on smaller datasets without preprocessing.
- It is not recommended for large datasets due to its scalability limitations and resource
consumption issues. - In domain-specific or structured time series forecasting tasks, models like XGBoost still outperform TabPFN.
- Future improvements in scaling TabPFN and adapting it for specialized tasks could make
it a strong candidate for broader practical use.