The goal of this analysis is to predict the price \(y_i\) for each Airbnb listing \(i\) based on some independent variables \(X\).

Based on the literature dealing with Airbnb listing prices (e.g. Gibbs et al. (2018), Teubner et al. (2017)), I assume that the following variables have an effect on the price:

df.reg <- df %>%
  select(city, price, overall_satisfaction, pic_count, 
         language,
         room_type, bed_type, reviews, accommodates) %>%
  mutate(city = as.factor(city),
         language = as.factor(language),
         room_type = as.factor(room_type),
         log_price = log(price)) %>%
  filter(price != 0)

Check correlations to see if we have a problem of multicollinearity:

df.reg %>%
  select(price, overall_satisfaction, pic_count,
         reviews, accommodates) %>%
  cor() %>%
  knitr::kable()
price overall_satisfaction pic_count reviews accommodates
price 1.0000000 0.0375982 0.2828296 -0.0224458 0.4936400
overall_satisfaction 0.0375982 1.0000000 0.0505758 -0.0518818 -0.1057124
pic_count 0.2828296 0.0505758 1.0000000 0.2398776 0.3288515
reviews -0.0224458 -0.0518818 0.2398776 1.0000000 0.0664417
accommodates 0.4936400 -0.1057124 0.3288515 0.0664417 1.0000000

To make predictions, the following steps are applied:

  1. Split the sample dataset into training and testing dataset.

  2. Estimate a linear model using the training data (make a prediction based on the model).

  3. Use test dataset to evaluate the model: predict the ‘test’ observation and compare between predicted response and actual response value (RMSE explains on an average how much of the predicted value will be from the actual value)

(1) Training / Test Split

bound <- floor((nrow(df.reg)/4)*3)         #define % of training and test set

df.reg <- df.reg[sample(nrow(df.reg)), ]           #sample rows 
df.train <- df.reg[1:bound, ]              #get training set
df.test <- df.reg[(bound+1):nrow(df.reg), ]    #get test set

In part 1 we saw that our dependent variable is left-skewed. This is somehow problematic, as the linear regression assumes normal distributed data. A common strategy to deal with left-skewed data is to take the logarithmic values (log-level model). Lets have a look at the log distributions.

p1 <- ggplot(df.train, aes(log(price))) +
  geom_density(fill = col[3], color = "white") +
  labs(title = "Train Data")

p2 <- ggplot(df.test, aes(log(price))) +
  geom_density(fill = col[3], color = "white") +
  labs(y="", title = "Test Data")

gridExtra::grid.arrange(p1, p2, nrow = 1)

(2) Estimate Training Data

lm.train <- lm(log_price ~ overall_satisfaction + reviews
              + room_type + accommodates + language
              + city + pic_count
             , data = df.train,
             na.action=na.pass)

summary(lm.train) 
## 
## Call:
## lm(formula = log_price ~ overall_satisfaction + reviews + room_type + 
##     accommodates + language + city + pic_count, data = df.train, 
##     na.action = na.pass)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7883 -0.2584 -0.0098  0.2417  2.6023 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            3.1969417  0.0522917  61.137  < 2e-16 ***
## overall_satisfaction   0.0080867  0.0005400  14.976  < 2e-16 ***
## reviews               -0.0004763  0.0001021  -4.665 3.11e-06 ***
## room_typePrivate room -0.4360823  0.0072406 -60.227  < 2e-16 ***
## room_typeShared room  -0.6870419  0.0298054 -23.051  < 2e-16 ***
## accommodates           0.1202199  0.0023962  50.170  < 2e-16 ***
## languagegerman        -0.0335872  0.0071295  -4.711 2.49e-06 ***
## cityDresden           -0.3081885  0.0175867 -17.524  < 2e-16 ***
## cityFrankfurt          0.1720158  0.0153393  11.214  < 2e-16 ***
## cityHamburg            0.0188804  0.0091882   2.055  0.03991 *  
## cityKöln               0.0303130  0.0109738   2.762  0.00575 ** 
## cityMünchen            0.3409656  0.0095901  35.554  < 2e-16 ***
## cityStuttgart         -0.0069149  0.0197534  -0.350  0.72630    
## pic_count              0.0089100  0.0004225  21.091  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4009 on 15453 degrees of freedom
## Multiple R-squared:  0.4751, Adjusted R-squared:  0.4746 
## F-statistic:  1076 on 13 and 15453 DF,  p-value: < 2.2e-16

F stats

This defines the collective effect of all predictor variables on the response variable. The null hypothesis states that the model with no independent variables fits the data as well as your model. This can be rejected as the p-value is very small.

Multiple R-squared

The \(R^2\) (multiple-R-squared) value indicates what proportion of the variation of the endogenous variable (the price) is explained by the model. The adjusted R-squared is normalized by the number of exogenous variables (n-1-k). In this case, the value is 0.4841 meaning that nearly \(49\%\) of the variance of the price can be explained by our model.

Coefficients

As I am estimating a log-level regression model (\(ln(y)=X\beta+\epsilon\)), the interpretation of the coefficients is as follows: \(\% \Delta y=100\beta \Delta X\)*. E.g. the coefficient of the rating variabels (overall_satisfaction) can be interpreted as “if the rating rises by 1 star, we’d expect the price to change by \(0.8 \%\), assuming that all other variables remain the same”. Interestingly, the number of reviews has a negative effect on the price. Although this result may not seem intuitive at first glance, it confirms the results of earlier studies on Airbnb listing prices Teubner et al. (2017)). Overall, all exogenous variables are highly significant, except for the dummy variables for Cologne and Stuttgart. The variable with the strongest effect is the dummy variable for shared room. If the listing offers a shared room, the price decreases by \(68\%\) compared to the case were the listing would offer the entire home and all other variables remain the same.

(*Technically, the interpretation is \(\% \Delta y=100(e^{\beta}-1)\) but the previously mentioned interpretation is approximately true for values \(-0.1 < \beta < 0.1\))

Visualising Residuals

par(mfrow=c(2,2))
plot(lm.train)

Fitted vs Residual graph

In this plot each point is one listing, where the prediction made by the model is on the x-axis, and the accuracy of the prediction is on the y-axis. The distance from the line at 0 is how bad the prediction was for that value.

Since…

Residual = Observed – Predicted

…positive values for the residual (on the y-axis) mean the prediction was too low, and negative values mean the prediction was too high; 0 means the guess was exactly correct.

Ideally your plot of the residuals to meet the following requirements:

  1. they’re pretty symmetrically distributed, tending to cluster towards the middle of the plot.

  2. they’re clustered around the lower single digits of the y-axis.

  3. in general there aren’t clear patterns.

Normal Q-Q Plot

Q-Q plot shows whether the residuals are normally distributed. Ideally, the plot should be on the dotted line. If the Q-Q plot is not on the line then models need to be reworked to make the residual normal. In the above plot, we see that most of the plots are on the line except the extreme points (beginning and end).

Scale-Location

This shows how the residuals are spread and whether the residuals have an equal variance or not.

Residuals vs Leverage

The plot helps to find influential observations. Here we need to check for points that are outside the dashed line (Cooks distance). A point outside the dashed line will be influential point and removal of that will affect the regression coefficients.

(3) Make Predictions & Evaluate the model

Next, I use the trained model to make predictions on my test data.

pred <- predict(lm.train, newdata = df.test)
# Combine predictions with test dataframe
pred <- as.data.frame(pred) 
pred$listing <- as.numeric(rownames(pred))
df.test$listing <- as.numeric(rownames(df.test))

pred.df <- left_join(pred, df.test %>%
                       select(log_price, listing),
                     by = "listing") %>%
  mutate(error = log_price-pred)

The RMSE (root mean squared error), also called RMSD (root mean squared deviation), is used to evaluate models by summarizing the differences between the actual (observed) and predicted values. As the square root of a variance, RMSE can be interpreted as the standard deviation of the unexplained variance, and has the useful property of being in the same units as the response variable. Lower values of RMSE indicate better fit.

\[ \text{RMSE} =\sqrt{\frac{1}{n}\sum^n_{i=1}(y_i-\hat{y_i})^2} \]

rmse <- function(error) {
  sqrt(mean(error^2))
  }

rmse.lm <- rmse(pred.df$error)
print(paste0("The RMSE is: ", rmse.lm))
## [1] "The RMSE is: 0.397974252398005"

The variance of the price in the test dataset is helpful to get a better understanding of the RMSE value:

print(paste0("The variance of log(price) is: ",var(df.test$log_price)))
## [1] "The variance of log(price) is: 0.310258926710368"

A plot of the predicted values for the price agains the actual values shows the explanatory power of the prediction model.

p <- ggplot(pred.df, aes(pred, log_price)) +
  geom_point(alpha = 0.8) +
  geom_smooth(method = lm) +
  labs(x="Predicted y", y="Actual y",
       title = "Predicted vs. True Values",
       subtitle = "LM with structural Data",
       caption = paste0("RMSE: ", round(rmse.lm,3)))

ggsave("../figs/residplot1.png", p)
Residual Plot

Residual Plot

In part 3 I will check if the description test of the listings give a better prediction power than the structural features estimated here.