
Background and Context
Machine Learning Challenge
When applying machine learning to stock market prediction using multiple characteristics, researchers face a critical problem of missing data values that prevent traditional analysis methods.
Current Practice
Simply dropping stocks with missing values is not feasible as it would eliminate 99% of stocks when analyzing 125 common predictors simultaneously.
Research Approach
This study examines 159 stock market predictors from 1985-2021, comparing simple mean imputation versus more sophisticated methods like expectation-maximization (EM).
Prevalence of Missing Data Increases with Number of Predictors
- As more predictors are considered, fewer stocks have complete data
- With 125 predictors, only 0.9% of stocks have all data points
- This demonstrates why imputation is necessary for machine learning applications
Simple Mean Imputation Performs Similarly to Complex Methods
- Neural network predictions show nearly identical returns using simple vs. complex imputation
- Equal-weighted portfolios achieve ~66-67% annual returns regardless of imputation method
- Value-weighted portfolios achieve ~37-39% annual returns regardless of imputation method
Imputation Errors Are Larger for Smaller Stocks
- Imputation errors decrease as firm size increases
- Smallest stocks (decile 1) have ~50% larger errors than largest stocks (decile 10)
- This suggests imputation quality varies systematically with firm size
Cross-Sectional Correlations Between Predictors Are Small
- Most predictor pairs have correlations between -0.25 and +0.25
- Only about 5% of pairs have correlations stronger than ±0.5
- Low correlations explain why simple mean imputation works well
Fewer Principal Components Needed When Using Scaled PCA
- Scaled PCA achieves maximum returns with fewer components
- Returns plateau after 30 components with scaled PCA vs. 70+ for standard PCA
- Demonstrates more efficient dimension reduction with scaled PCA
Contribution and Implications
- Simple mean imputation is recommended for machine learning applications in asset pricing, as it performs similarly to complex methods while being more transparent and easier to implement
- The dimensionality of expected returns depends on measurement choices, with supervised methods requiring fewer components than traditional approaches
- Separate analysis by firm size improves performance, particularly for value-weighted portfolios
Data Sources
- Missing Data Chart: Constructed from Table 2 Panel (b) showing percentage of stocks with complete data
- Returns Chart: Based on Table 4 using neural network (NN1) results for different imputation methods
- Error Chart: Constructed from Figure 4 showing imputation errors by market equity decile
- Correlation Chart: Based on distribution described in Section 4.3 and Figure 3
- PCA Chart: Constructed from comparison of results in Figures 2 and 6