Please rotate your device to landscape mode to view the charts.

Background and Context

Machine Learning Challenge

When applying machine learning to stock market prediction using multiple characteristics, researchers face a critical problem of missing data values that prevent traditional analysis methods.

Current Practice

Simply dropping stocks with missing values is not feasible as it would eliminate 99% of stocks when analyzing 125 common predictors simultaneously.

Research Approach

This study examines 159 stock market predictors from 1985-2021, comparing simple mean imputation versus more sophisticated methods like expectation-maximization (EM).

Prevalence of Missing Data Increases with Number of Predictors

  • As more predictors are considered, fewer stocks have complete data
  • With 125 predictors, only 0.9% of stocks have all data points
  • This demonstrates why imputation is necessary for machine learning applications

Simple Mean Imputation Performs Similarly to Complex Methods

  • Neural network predictions show nearly identical returns using simple vs. complex imputation
  • Equal-weighted portfolios achieve ~66-67% annual returns regardless of imputation method
  • Value-weighted portfolios achieve ~37-39% annual returns regardless of imputation method

Imputation Errors Are Larger for Smaller Stocks

  • Imputation errors decrease as firm size increases
  • Smallest stocks (decile 1) have ~50% larger errors than largest stocks (decile 10)
  • This suggests imputation quality varies systematically with firm size

Cross-Sectional Correlations Between Predictors Are Small

  • Most predictor pairs have correlations between -0.25 and +0.25
  • Only about 5% of pairs have correlations stronger than ±0.5
  • Low correlations explain why simple mean imputation works well

Fewer Principal Components Needed When Using Scaled PCA

  • Scaled PCA achieves maximum returns with fewer components
  • Returns plateau after 30 components with scaled PCA vs. 70+ for standard PCA
  • Demonstrates more efficient dimension reduction with scaled PCA

Contribution and Implications

  • Simple mean imputation is recommended for machine learning applications in asset pricing, as it performs similarly to complex methods while being more transparent and easier to implement
  • The dimensionality of expected returns depends on measurement choices, with supervised methods requiring fewer components than traditional approaches
  • Separate analysis by firm size improves performance, particularly for value-weighted portfolios

Data Sources

  • Missing Data Chart: Constructed from Table 2 Panel (b) showing percentage of stocks with complete data
  • Returns Chart: Based on Table 4 using neural network (NN1) results for different imputation methods
  • Error Chart: Constructed from Figure 4 showing imputation errors by market equity decile
  • Correlation Chart: Based on distribution described in Section 4.3 and Figure 3
  • PCA Chart: Constructed from comparison of results in Figures 2 and 6