Many texts are excellent sources of knowledge about individual statistical tools, but the art of data analysis is about choosing and using multiple tools. Instead of presenting isolated techniques, this text emphasizes problem solving strategies that address the many issues arising when developing multivariable models using real data and not standard textbook examples. It includes imputation methods for dealing with missing data effectively, methods for dealing with nonlinear relationships and for making the estimation of transformations a formal part of the modeling process, methods for dealing with "too many variables to analyze and not enough observations," and powerful model validation techniques based on the bootstrap. This text realistically deals with model uncertainty and its effects on inference to achieve "safe data mining".

from a regression fit, and these diagrams can be used to communicate modeling results as well as to obtain predicted values manually even in the presence of complex variable transformations. Most of the methods in this text apply to all regression models, but special emphasis is given to some of the most popular ones: multiple regression using least squares, the binary logistic model, two logistic models for ordinal responses, parametric survival regression models, and the Cox semiparametric

between the k categories. It is recommended that this test be done before attempting to interpret individual parameter estimates. If the overall test is not significant, it can be dangerous to rely on individual pairwise comparisons because the type I error will be increased. Likewise, for a continuous predictor for which linearity is not assumed, all terms involving the predictor should be tested simultaneously to check whether the factor is associated with the outcome. This test should precede

collinearity is with variance inflation factors or VIF, which in ordinary least squares are diagonals of the inverse of the X' X matrix scaled to have unit variance (except that a column of 1s is retained corresponding to the intercept). Note that some authors compute VIF from the correlation matrix form of the design matrix, omitting the intercept. VIFi is 1/(1- Rr) where Rr is the squared multiple correlation coefficient between column i and the remaining columns of the design matrix. For

confounders. Chapter 4. Multivariable Modeling Strategies 84 4.11 Further Reading [!] Some good general references that address modeling strategies are [185, 325, 400]. ~ Simulation studies are needed to determine the effects of modifying the model based on assessments of "predictor promise." Although it is unlikely that this strategy will result in regression coefficients that are biased high in absolute value, it may on some occasions result in somewhat optimistic standard errors and a

predictors, the settings of interacting factors are very important. For others, these settings are irrelevant for this graph. As an example, the effect of increasing population density from its first quartile (1.23) to its third quartile (1.987) is to add approximately an average of 2.3% voters voting Democratic. The 0.95 confidence interval for this mean effect is [1.37, 3.23]. This range of 1.987- 1.23 or 0.756 on the log 10 population density scale corresponds to a 10°· 756 = 5.7-fold