Please rotate your device to landscape mode to view the charts.

Missing Values Handling for Machine Learning Portfolios

Journal: Journal of Financial Economics

Date: 20240501

Author: Chen, Andrew Y.; McCoy, Jack

Abstract:
We characterize the structure and origins of missingness for 159 cross-sectional return predictors and study missing value handling for portfolios constructed using machine learning. Simply imputing with cross-sectional means performs well compared to rigorous expectation-maximization methods. This stems from three facts about predictor data: (1) missingness occurs in large blocks organized by time, (2) cross-sectional correlations are small, and (3) missingness tends to occur in blocks organized by the underlying data source. As a result, observed data provide little information about missing data. Sophisticated imputations introduce estimation noise that can lead to underperformance if machine learning is not carefully applied.

Link: Google Scholar

Background and Context

Machine Learning Challenge

When applying machine learning to stock market prediction using multiple characteristics, researchers face a critical problem of missing data values that prevent traditional analysis methods.

Current Practice

Simply dropping stocks with missing values is not feasible as it would eliminate 99% of stocks when analyzing 125 common predictors simultaneously.

Research Approach

This study examines 159 stock market predictors from 1985-2021, comparing simple mean imputation versus more sophisticated methods like expectation-maximization (EM).

Prevalence of Missing Data Increases with Number of Predictors

As more predictors are considered, fewer stocks have complete data
With 125 predictors, only 0.9% of stocks have all data points
This demonstrates why imputation is necessary for machine learning applications

Simple Mean Imputation Performs Similarly to Complex Methods

Neural network predictions show nearly identical returns using simple vs. complex imputation
Equal-weighted portfolios achieve ~66-67% annual returns regardless of imputation method
Value-weighted portfolios achieve ~37-39% annual returns regardless of imputation method

Imputation Errors Are Larger for Smaller Stocks

Imputation errors decrease as firm size increases
Smallest stocks (decile 1) have ~50% larger errors than largest stocks (decile 10)
This suggests imputation quality varies systematically with firm size

Cross-Sectional Correlations Between Predictors Are Small

Most predictor pairs have correlations between -0.25 and +0.25
Only about 5% of pairs have correlations stronger than ±0.5
Low correlations explain why simple mean imputation works well

Fewer Principal Components Needed When Using Scaled PCA

Scaled PCA achieves maximum returns with fewer components
Returns plateau after 30 components with scaled PCA vs. 70+ for standard PCA
Demonstrates more efficient dimension reduction with scaled PCA

Contribution and Implications

Simple mean imputation is recommended for machine learning applications in asset pricing, as it performs similarly to complex methods while being more transparent and easier to implement
The dimensionality of expected returns depends on measurement choices, with supervised methods requiring fewer components than traditional approaches
Separate analysis by firm size improves performance, particularly for value-weighted portfolios

Data Sources

Missing Data Chart: Constructed from Table 2 Panel (b) showing percentage of stocks with complete data
Returns Chart: Based on Table 4 using neural network (NN1) results for different imputation methods
Error Chart: Constructed from Figure 4 showing imputation errors by market equity decile
Correlation Chart: Based on distribution described in Section 4.3 and Figure 3
PCA Chart: Constructed from comparison of results in Figures 2 and 6