The Centers for Medicare and Medicaid Services (CMS) calculates overall ratings for a skilled nursing facility with their Five-Star Quality Rating System. They perform health inspections in nursing facilities and examine their staffing levels and quality measures. We will be applying regression to a dataframe containing information on the location, staffing, and performance of Medicaid and Medicare certified skilled nursing facilities in the United States to mathematically determine if regression will come to the same conclusions as the CMS on overall rating calculations.
Regression, Correlation, and Best Features
Overall ratings is our ordinal response variable, but what independent variables would be good predictors of overall ratings? We visualize a correlation matrix to highlight any multicollinearity and utilize a random forest regressor in Python to calculate permutation importance. Random Forest Regression is a supervised machine learning algorithm that is classified as an ensemble method. We will train a model using the skilled nursing facility data, and assess how accurate it is in predicting overall ratings.
As hinted in the name, supervised learning uses data labelled with their true response values to predict outcomes of unseen data. Random Forest uses bootstrap aggregation to randomly sample data with replacement to train many decision tree regressors and then the average of their results is taken. Decision trees are notorious for overfitting and high variance, so using an ensemble method will help mitigate these weaknesses.
The independent variables that were considered to take part in our model are the following:
● Number of Certified Beds
● Health Inspection Ratings and Total Weighted Health Survey Score
● QM Ratings, Long-Stay and Short-Stay QM Ratings
● Staffing and RN Staffing Ratings; Reported Nurse Aide, LPN, RN, Licensed, Total Nurse, and Physical Therapist Staffing Hours per Resident per Day; Case-Mix Nurse Aide, LPN, RN, and Total Nurse Staffing Hours per Resident per Day; Adjusted Nurse Aide, LPN, RN, and Total Nurse Staffing Hours per Resident per Day,
● Rating Cycle 1, 2, and 3 Total Number of Health Deficiencies, Number of Standard Health Deficiencies, Number of Complaint Health Deficiencies, Health Deficiency Scores, Number of Health Revisits, Health Revisit Scores, and Total Health Scores.
● Number of Facility Reported Incidents, Substantiated Complaints, Citations from Infection Control Inspections, Fines, and Total Amount of Fines in Dollars. Number of Payment Denials, Total Number of Penalties.
● Ownership Type
● Binary categorical variables: Provider Resides in Hospital, Continuing Care Retirement Community, Special Focus Status_SFF Candidate, Abuse Icon, Most Recent Health Inspection More Than 2 Years Ago, Provider Changed Ownership in Last 12 Months.
● With a Resident and Family Council
We can see immediately that many of the variables will likely be multicollinear, since they are highly related to each other. We will have to weed out those highly correlated covariates to obtain a more accurate interpretation of a model. We visualize the correlation matrix of the pairwise correlation coefficients between a select few of covariates below:

Figure 1: Correlation Matrix of Select Covariates in the NH Provider Info July 2021
It makes sense that the QM ratings are highly correlated with each other and that the staffing and health survey covariates follow suit. Interestingly, the total weighted health survey score is moderately linearly correlated with the total number of penalties. We calculated permutation importance of all the features to see which ones are statistically significant predictors of overall ratings. We record a baseline coefficient of determination by passing out-of-bag samples through the Random Forest and permute a single predictor column and pass all the test samples back into the algorithm. Feature importance is the difference between the baseline score and the drop in accuracy from permuting a column.

Figure 2: Permutation Importance via Drop in Model Accuracy
Although we had many covariates, feature importance was only computed for three covariates, which implies that there was no drop in model accuracy for any of the other covariates. We could then say that the health inspection rating, staffing rating, and QM rating are important features in predicting a skilled nursing facility’s overall rating. This finding makes sense since we already know that CMS uses these areas to calculate an overall rating. We computed Variance Inflation Factors for these three predictors to reassure that there is no multicollinearity between them. Training a regular multiple regression model with these three covariates gave a coefficient of determination of 0.897, but we know that the interpretation of a linear model with an ordinal dependent variable will likely be inaccurate. We trained a separate Random Forest Regression model with just the three predictors, and we achieved a coefficient of determination of 0.9999995, indicating very high accuracy predicting overall ratings with this model. We took the rounded averages of all three covariates and inputted them into our model to predict an overall rating. The rounded averages of the covariates in the order of the permutation importance table above are 3, 3, and 4, and the predicted overall rating is 4. So without knowing exactly how CMS calculates overall ratings, we can safely use Random Forest Regression and skilled nursing information to calculate accurate overall ratings.
By Beatrice Ling