Structured vs. Text Data to predict Airbnb prices

Can we improve the model from part 2 using the description text of the listing as exogenous variables?

As before, the goal is to predict the price \(y_i\) for each Airbnb listing \(i\) based on some independent variables \(X\). In this model the counts of the unique terms \(c_i\) of each listing description are used. This is a regression problem like any other, except that the high-dimensionality of \(c_i\) makes OLS and other standard techniques difficult to estimate (overfitting).

Here, we have \(d=20.637\) documents (Airbnb listings) each of which is \(w\) words long. Each word is drawn from a vocabulary of \(p=33.469\) possible words. The unique representation of each document has dimension \(p^w\). A common strategy to deal with the high-dimensionality of text data is the estimation of penalized linear models (Gentzkow, 2017).

I will estimate three different models using the same training data: (1) Linear Regression (as in part 2), (2) Penalized Linear Regression using Ridge (L2 Norm) and (3) Penlalized Linear Regression using Lasso (L1 Norm). I will then use these models to make predictions on the test data to see which performs better.

Document Term Matrix

A first step to use textdata in a prediction model is to convert it to a Document Term Matrix, where each row is a observation (document) and each column is a unique term.

corp <- Corpus(VectorSource(df$text_cleaned))
dtm <- DocumentTermMatrix(corp)

dtm

## <<DocumentTermMatrix (documents: 20637, terms: 33469)>>
## Non-/sparse entries: 650794/690048959
## Sparsity           : 100%
## Maximal term length: 132
## Weighting          : term frequency (tf)

The first five observations of the Document Term Matrix look like this:

inspect(dtm[1:5, 100:107])

## <<DocumentTermMatrix (documents: 5, terms: 8)>>
## Non-/sparse entries: 9/31
## Sparsity           : 78%
## Maximal term length: 12
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs bäcker bus ebenfalls ebenso eingerichtet emili englischen fuß
##    1      0   0         0      0            0     0          0   0
##    2      0   0         0      0            0     0          0   0
##    3      1   1         1      1            1     1          1   2
##    4      0   0         0      0            0     0          0   0
##    5      0   0         0      0            0     0          0   1

Matrices in text analysis problems tend to be very sparse. That is, most of the elements are zero, which implies that they have many parameters that are uninformative.

Reducing sparsity tends to have the effect of both reducing overfitting and improving the predictive abilities of the model. Here we are reducing the sparsity of the document-term matrix so that the sparsity (% of non-zeros) is a maximum of 99%.

dtm<-removeSparseTerms(dtm,0.99)
dtm

## <<DocumentTermMatrix (documents: 20637, terms: 586)>>
## Non-/sparse entries: 403573/11689709
## Sparsity           : 97%
## Maximal term length: 21
## Weighting          : term frequency (tf)

inspect(dtm[1:5, 1:7])

## <<DocumentTermMatrix (documents: 5, terms: 7)>>
## Non-/sparse entries: 7/28
## Sparsity           : 80%
## Maximal term length: 12
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs ausgestattet bad badewanne befindet gelegen gemütlichen gibt
##    1            1   1         1        1       1           1    1
##    2            0   0         0        0       0           0    0
##    3            0   0         0        0       0           0    0
##    4            0   0         0        0       0           0    0
##    5            0   0         0        0       0           0    0

# Convert to Dataframe
dtm.df <- as.matrix(dtm) %>%
  as.data.frame()

# Merge with orignal dataframe
dtm.df$document <-as.integer(rownames(dtm.df))
df$document <- as.integer(rownames(df))

df.reg <- dtm.df %>%
  left_join(df %>%
              select(document, price),
            by = "document") %>%
  filter(price != 0) %>%
  mutate(log_price = log(price)) %>%
  select(-document, -price)

Training / Test Split

#define % of training and test set
bound <- floor((nrow(df.reg)/4)*3)
#sample rows
df.reg <- df.reg[sample(nrow(df.reg)), ]          

# train data
df.train <- df.reg[1:bound, ]              
x.train <- as.matrix(df.train %>% select(-log_price))
y.train <- as.matrix(df.train %>% select(log_price))

# test data
df.test <- df.reg[(bound+1):nrow(df.reg), ]    
x.test <- as.matrix(df.test %>% select(-log_price))
y.test <- as.matrix(df.test %>% select(log_price))

Estimation

(1) Linear Regresion

fit.lm <- lm(log_price~., data = df.train)

summary(fit.lm)

## 
## Call:
## lm(formula = log_price ~ ., data = df.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.17999 -0.28385 -0.01883  0.26048  2.67883 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.2272067  0.0099419 425.191  < 2e-16 ***
## ausgestattet           0.0015780  0.0175820   0.090 0.928488    
## bad                   -0.0105836  0.0141975  -0.745 0.456007    
## badewanne             -0.0051111  0.0229772  -0.222 0.823971    
## befindet              -0.0157357  0.0115442  -1.363 0.172877    
## gelegen               -0.0182134  0.0155811  -1.169 0.242445    
## gemütlichen           -0.0492232  0.0250931  -1.962 0.049825 *  
## gibt                  -0.0063967  0.0123702  -0.517 0.605091    
## große                  0.0852306  0.0147256   5.788 7.27e-09 ***
## großem                -0.0341203  0.0272284  -1.253 0.210184    
## gut                   -0.0184871  0.0157399  -1.175 0.240197    
## küche                 -0.0209072  0.0124514  -1.679 0.093152 .  
## lädt                   0.0080828  0.0371274   0.218 0.827663    
## moderne                0.1096874  0.0224613   4.883 1.05e-06 ***
## schlafzimmer           0.1326854  0.0149071   8.901  < 2e-16 ***
## schönes               -0.0545600  0.0172711  -3.159 0.001586 ** 
## terrasse               0.0870699  0.0225338   3.864 0.000112 ***
## vorhanden             -0.0066288  0.0176891  -0.375 0.707860    
## wlan                   0.0272998  0.0187246   1.458 0.144872    
## wohnung                0.0421703  0.0056075   7.520 5.78e-14 ***
## wohnzimmer             0.0155356  0.0151299   1.027 0.304523    
## apartment              0.0746373  0.0056505  13.209  < 2e-16 ***
## around                -0.0171555  0.0217623  -0.788 0.430526    
## balcony                0.0064247  0.0118593   0.542 0.588003    
## bed                    0.0147667  0.0129647   1.139 0.254727    
## bright                -0.0260227  0.0130274  -1.998 0.045785 *  
## central                0.0253044  0.0099153   2.552 0.010718 *  
## city                   0.0265573  0.0091792   2.893 0.003819 ** 
## corner                 0.0314928  0.0287824   1.094 0.273899    
## cozy                  -0.0605017  0.0109373  -5.532 3.23e-08 ***
## distance               0.0221905  0.0265621   0.835 0.403495    
## district              -0.0419405  0.0256535  -1.635 0.102095    
## equipped              -0.0099017  0.0242858  -0.408 0.683488    
## famous                 0.0190003  0.0236676   0.803 0.422103    
## fully                  0.0060799  0.0240739   0.253 0.800619    
## furnished              0.0069534  0.0228430   0.304 0.760826    
## great                 -0.0182561  0.0153623  -1.188 0.234708    
## hamburg               -0.0101505  0.0110159  -0.921 0.356837    
## house                 -0.0119051  0.0219289  -0.543 0.587210    
## little                -0.0225041  0.0232835  -0.967 0.333798    
## middle                 0.0231360  0.0274956   0.841 0.400113    
## min                   -0.0272186  0.0050563  -5.383 7.43e-08 ***
## neustadt              -0.1168588  0.0202939  -5.758 8.66e-09 ***
## one                    0.0259935  0.0117029   2.221 0.026358 *  
## public                 0.0030808  0.0300647   0.102 0.918382    
## quiet                 -0.0333059  0.0117427  -2.836 0.004570 ** 
## reeperbahn             0.0350781  0.0254327   1.379 0.167838    
## shops                 -0.0062047  0.0190579  -0.326 0.744754    
## side                   0.0601929  0.0342783   1.756 0.079108 .  
## transport             -0.0214931  0.0312633  -0.687 0.491787    
## walking                0.0169308  0.0231493   0.731 0.464563    
## wifi                   0.0281592  0.0184567   1.526 0.127107    
## bahn                   0.0012143  0.0076904   0.158 0.874545    
## balkon                 0.0142836  0.0100466   1.422 0.155127    
## bars                   0.0001337  0.0125924   0.011 0.991532    
## befinden              -0.0195278  0.0217744  -0.897 0.369826    
## braucht               -0.0320069  0.0230030  -1.391 0.164119    
## bus                   -0.0065785  0.0116846  -0.563 0.573443    
## bäcker                -0.0166970  0.0345527  -0.483 0.628937    
## ebenfalls              0.0635662  0.0264886   2.400 0.016418 *  
## eingerichtet           0.0413982  0.0217465   1.904 0.056973 .  
## fuß                    0.0210797  0.0128362   1.642 0.100567    
## garten                 0.0209353  0.0156985   1.334 0.182360    
## gemütliche            -0.0343760  0.0126727  -2.713 0.006683 ** 
## helle                  0.0116063  0.0153373   0.757 0.449220    
## innenhof               0.0423675  0.0299923   1.413 0.157790    
## kleine                -0.0684600  0.0169583  -4.037 5.44e-05 ***
## lage                  -0.0189328  0.0120287  -1.574 0.115517    
## liebe                 -0.0727968  0.0308751  -2.358 0.018397 *  
## nähe                  -0.0364430  0.0111999  -3.254 0.001141 ** 
## restaurant             0.0927973  0.0314932   2.947 0.003218 ** 
## rewe                   0.0008176  0.0342977   0.024 0.980982    
## ruhig                 -0.0060956  0.0136155  -0.448 0.654378    
## schöne                 0.0400002  0.0133553   2.995 0.002748 ** 
## sowie                 -0.0207920  0.0132880  -1.565 0.117670    
## super                 -0.0118280  0.0142130  -0.832 0.405313    
## top                    0.0280171  0.0168849   1.659 0.097076 .  
## tram                   0.0017737  0.0148211   0.120 0.904744    
## unmittelbarer          0.0569164  0.0209369   2.718 0.006566 ** 
## verkehrsmittel         0.0286939  0.0368293   0.779 0.435931    
## viele                 -0.0123045  0.0192811  -0.638 0.523376    
## zentraler              0.0287066  0.0240912   1.192 0.233445    
## öffentlichen          -0.0420288  0.0336979  -1.247 0.212335    
## ausstattung            0.1406991  0.0306480   4.591 4.45e-06 ***
## bietet                 0.0038632  0.0165232   0.234 0.815140    
## blick                  0.0256030  0.0192029   1.333 0.182458    
## cafes                  0.0160609  0.0181911   0.883 0.377305    
## gemütlichkeit         -0.0049912  0.0394485  -0.127 0.899319    
## mitten                 0.0575851  0.0138532   4.157 3.25e-05 ***
## perfekt                0.0194842  0.0217817   0.895 0.371058    
## restaurants            0.0549903  0.0115340   4.768 1.88e-06 ***
## ruhe                  -0.0392606  0.0342331  -1.147 0.251458    
## schönen               -0.0354990  0.0163021  -2.178 0.029453 *  
## schönsten              0.0205983  0.0347993   0.592 0.553916    
## stadt                  0.0049865  0.0168819   0.295 0.767713    
## stock                 -0.0243093  0.0229297  -1.060 0.289087    
## doppelbett             0.0057392  0.0180476   0.318 0.750488    
## matratze              -0.0504601  0.0363941  -1.386 0.165618    
## messe                  0.0872943  0.0118760   7.350 2.08e-13 ***
## oktoberfest            0.2674595  0.0119351  22.410  < 2e-16 ***
## personen               0.0360438  0.0131007   2.751 0.005943 ** 
## stadtzentrum          -0.0116049  0.0254652  -0.456 0.648602    
## zimmer                -0.1084624  0.0059340 -18.278  < 2e-16 ***
## beautiful              0.0068823  0.0117726   0.585 0.558820    
## berlin                -0.0221973  0.0063600  -3.490 0.000484 ***
## big                    0.0211750  0.0113727   1.862 0.062635 .  
## cool                   0.0201674  0.0350777   0.575 0.565343    
## diverse               -0.0713262  0.0351468  -2.029 0.042438 *  
## everything             0.0125282  0.0218337   0.574 0.566110    
## friedrichshain        -0.0232669  0.0135908  -1.712 0.086925 .  
## just                  -0.0063141  0.0131233  -0.481 0.630429    
## kreuzberg              0.0348044  0.0117793   2.955 0.003135 ** 
## meters                 0.0268615  0.0288529   0.931 0.351880    
## minutes               -0.0353404  0.0095027  -3.719 0.000201 ***
## mitte                  0.1026752  0.0124148   8.270  < 2e-16 ***
## nearby                -0.0327500  0.0263133  -1.245 0.213293    
## need                  -0.0211064  0.0222383  -0.949 0.342584    
## neighbourhood          0.0044518  0.0293598   0.152 0.879483    
## situated              -0.0134587  0.0284715  -0.473 0.636430    
## stations               0.0198568  0.0266727   0.744 0.456610    
## stay                  -0.0004650  0.0174833  -0.027 0.978782    
## straße                -0.0409802  0.0183789  -2.230 0.025779 *  
## subway                 0.0492660  0.0194151   2.538 0.011175 *  
## tor                    0.0832458  0.0241670   3.445 0.000573 ***
## außerdem              -0.0060825  0.0288474  -0.211 0.833007    
## gemütliches           -0.1132463  0.0154768  -7.317 2.66e-13 ***
## gerne                 -0.0391902  0.0184674  -2.122 0.033843 *  
## parks                  0.0029712  0.0199765   0.149 0.881763    
## spree                 -0.0241459  0.0305209  -0.791 0.428882    
## vermiete              -0.0325655  0.0236561  -1.377 0.168650    
## viertel               -0.0004609  0.0204736  -0.023 0.982039    
## zuhause                0.0468754  0.0318377   1.472 0.140954    
## zwei                   0.0239291  0.0131607   1.818 0.069049 .  
## bedroom                0.0101992  0.0146708   0.695 0.486941    
## direkter              -0.0037489  0.0281091  -0.133 0.893903    
## garden                 0.0986613  0.0210609   4.685 2.83e-06 ***
## hinterhof             -0.0326399  0.0303210  -1.076 0.281731    
## ruhigen               -0.0212747  0.0188476  -1.129 0.259010    
## verfügt                0.0085270  0.0204054   0.418 0.676039    
## waschmaschine          0.0143476  0.0229663   0.625 0.532161    
## wunderschönen          0.0560929  0.0319584   1.755 0.079248 .  
## zentral                0.0290875  0.0120755   2.409 0.016017 *  
## zugang                 0.0117976  0.0290911   0.406 0.685088    
## appartement            0.0346240  0.0171681   2.017 0.043738 *  
## away                  -0.0088577  0.0126724  -0.699 0.484579    
## door                  -0.0065333  0.0299309  -0.218 0.827216    
## front                 -0.0149020  0.0313047  -0.476 0.634060    
## good                  -0.0367215  0.0197328  -1.861 0.062772 .  
## kitchen                0.0133101  0.0160066   0.832 0.405684    
## nice                  -0.0314263  0.0111012  -2.831 0.004648 ** 
## really                -0.0128688  0.0291311  -0.442 0.658673    
## relax                  0.0492222  0.0386989   1.272 0.203419    
## room                  -0.1246793  0.0064275 -19.398  < 2e-16 ***
## see                   -0.0652959  0.0294538  -2.217 0.026646 *  
## small                 -0.0659632  0.0174887  -3.772 0.000163 ***
## sqm                    0.0605256  0.0218520   2.770 0.005616 ** 
## takes                 -0.0726685  0.0337726  -2.152 0.031437 *  
## underground            0.0486573  0.0284827   1.708 0.087599 .  
## walk                   0.0182623  0.0124433   1.468 0.142223    
## appartment             0.0650321  0.0140900   4.615 3.95e-06 ***
## entfernt              -0.0148852  0.0116717  -1.275 0.202215    
## fernseher             -0.0171077  0.0326043  -0.525 0.599796    
## nah                   -0.0763087  0.0267219  -2.856 0.004301 ** 
## people                 0.0256275  0.0162798   1.574 0.115465    
## ruhige                -0.0254296  0.0167809  -1.515 0.129693    
## cafés                  0.0264870  0.0162035   1.635 0.102145    
## ecke                   0.0197859  0.0215652   0.917 0.358899    
## markt                  0.0878370  0.0286004   3.071 0.002136 ** 
## prenzlauer             0.0408291  0.0354271   1.152 0.249142    
## strasse               -0.0116010  0.0309751  -0.375 0.708019    
## unsere                 0.0174291  0.0154623   1.127 0.259676    
## zentrale               0.0826315  0.0204897   4.033 5.54e-05 ***
## internet              -0.0118211  0.0231573  -0.510 0.609729    
## modernes               0.0148115  0.0294731   0.503 0.615294    
## dass                   0.0120850  0.0309205   0.391 0.695919    
## deutz                  0.0210487  0.0286016   0.736 0.461786    
## herzen                 0.0893597  0.0132752   6.731 1.74e-11 ***
## köln                  -0.0277469  0.0131336  -2.113 0.034646 *  
## nahe                  -0.0396290  0.0211971  -1.870 0.061566 .  
## access                 0.0290595  0.0243583   1.193 0.232888    
## airport               -0.0568397  0.0200457  -2.836 0.004582 ** 
## center                 0.0200194  0.0121686   1.645 0.099955 .  
## easy                  -0.0394133  0.0275989  -1.428 0.153291    
## floor                  0.0089184  0.0174592   0.511 0.609489    
## located               -0.0051653  0.0118827  -0.435 0.663793    
## location               0.0127368  0.0155421   0.820 0.412513    
## lovely                 0.0064576  0.0173651   0.372 0.709993    
## minute                -0.0114820  0.0206661  -0.556 0.578496    
## park                   0.0040679  0.0127668   0.319 0.750011    
## supermarkets          -0.0811938  0.0263650  -3.080 0.002077 ** 
## view                   0.0122063  0.0191641   0.637 0.524177    
## alster                 0.0524945  0.0196509   2.671 0.007563 ** 
## altbau                 0.0206829  0.0145884   1.418 0.156281    
## altbauwohnung          0.0761499  0.0137577   5.535 3.16e-08 ***
## charming               0.0821146  0.0207380   3.960 7.54e-05 ***
## comfortable           -0.0396128  0.0175003  -2.264 0.023616 *  
## eimsbüttel             0.0171372  0.0196223   0.873 0.382486    
## flat                   0.0359521  0.0076212   4.717 2.41e-06 ***
## floors                 0.0486565  0.0424595   1.146 0.251834    
## heart                  0.0355267  0.0129317   2.747 0.006017 ** 
## many                   0.0172921  0.0212225   0.815 0.415199    
## offers                 0.0151267  0.0328505   0.460 0.645186    
## part                  -0.0159139  0.0303145  -0.525 0.599618    
## popular               -0.0036247  0.0363784  -0.100 0.920632    
## wooden                -0.0791158  0.0421691  -1.876 0.060654 .  
## anbindung             -0.0163663  0.0191839  -0.853 0.393603    
## auto                   0.0621589  0.0329836   1.885 0.059512 .  
## berlins                0.0112818  0.0229913   0.491 0.623647    
## bettwäsche             0.0184373  0.0385309   0.479 0.632296    
## direkt                 0.0041100  0.0102488   0.401 0.688407    
## erkunden              -0.0190442  0.0322385  -0.591 0.554713    
## erreichen              0.0137518  0.0143984   0.955 0.339547    
## fenster               -0.0266810  0.0268862  -0.992 0.321034    
## freuen                -0.0504052  0.0307414  -1.640 0.101099    
## fuss                   0.0500963  0.0245417   2.041 0.041242 *  
## handtücher            -0.0326666  0.0375168  -0.871 0.383921    
## haustür                0.0185165  0.0254601   0.727 0.467068    
## kleiderschrank        -0.0222209  0.0385094  -0.577 0.563931    
## komplett               0.0251911  0.0224358   1.123 0.261538    
## kurfürstendamm         0.0501545  0.0301395   1.664 0.096118 .  
## kühlschrank           -0.0590572  0.0305437  -1.934 0.053190 .  
## liegt                 -0.0175621  0.0109397  -1.605 0.108436    
## meter                 -0.0038408  0.0206467  -0.186 0.852428    
## minuten               -0.0210673  0.0079275  -2.657 0.007881 ** 
## nachbarschaft          0.0430704  0.0287822   1.496 0.134565    
## neu                    0.0544815  0.0244485   2.228 0.025868 *  
## platz                  0.0181049  0.0130044   1.392 0.163880    
## potsdamer              0.0484570  0.0279157   1.736 0.082614 .  
## private                0.0085439  0.0151098   0.565 0.571774    
## schreibtisch          -0.0937358  0.0297856  -3.147 0.001653 ** 
## sehenswürdigkeiten     0.0893435  0.0296974   3.008 0.002630 ** 
## station               -0.0340410  0.0106558  -3.195 0.001403 ** 
## verkehrsanbindung     -0.0603553  0.0290956  -2.074 0.038062 *  
## wenige                -0.0111087  0.0273491  -0.406 0.684615    
## wenigen               -0.0379286  0.0277297  -1.368 0.171395    
## wohnen                 0.0373002  0.0207518   1.797 0.072285 .  
## einkaufsmöglichkeiten -0.0373260  0.0174262  -2.142 0.032214 *  
## erreichbar            -0.0154662  0.0156404  -0.989 0.322745    
## fußläufig             -0.0080428  0.0204400  -0.393 0.693967    
## großes                -0.0315261  0.0158179  -1.993 0.046273 *  
## gute                  -0.0125170  0.0205344  -0.610 0.542162    
## guter                 -0.0378440  0.0324059  -1.168 0.242902    
## sofa                  -0.0328983  0.0191785  -1.715 0.086297 .  
## stadtpark             -0.0802948  0.0249240  -3.222 0.001278 ** 
## weitere               -0.0405505  0.0326688  -1.241 0.214529    
## hell                   0.0020565  0.0216266   0.095 0.924242    
## umgebung               0.0043583  0.0201797   0.216 0.829010    
## bahnhof               -0.0295209  0.0187778  -1.572 0.115944    
## etage                 -0.0095915  0.0308690  -0.311 0.756021    
## `m²`                   0.0231938  0.0149752   1.549 0.121449    
## paar                  -0.0344290  0.0320544  -1.074 0.282804    
## stehen                 0.0655138  0.0365249   1.794 0.072885 .  
## verfügung             -0.0615230  0.0265053  -2.321 0.020292 *  
## calm                  -0.0273191  0.0263244  -1.038 0.299388    
## centre                -0.0199528  0.0181190  -1.101 0.270824    
## get                   -0.0081989  0.0226305  -0.362 0.717137    
## single                -0.0892672  0.0277784  -3.214 0.001314 ** 
## altstadt               0.0641370  0.0244432   2.624 0.008701 ** 
## aufzug                 0.0980977  0.0378626   2.591 0.009582 ** 
## bett                  -0.0274573  0.0150312  -1.827 0.067768 .  
## cologne               -0.0089123  0.0163580  -0.545 0.585881    
## dom                    0.0032825  0.0255171   0.129 0.897644    
## geeignet               0.0127017  0.0219490   0.579 0.562804    
## hauptbahnhof           0.0023639  0.0156212   0.151 0.879722    
## hohe                   0.0639593  0.0490541   1.304 0.192304    
## kölner                -0.0084563  0.0221481  -0.382 0.702611    
## lan                   -0.0296429  0.0315850  -0.939 0.347997    
## max                    0.0599493  0.0270813   2.214 0.026866 *  
## neben                 -0.0006954  0.0319421  -0.022 0.982632    
## studio                -0.0073350  0.0129771  -0.565 0.571931    
## hamburgs               0.0065965  0.0263963   0.250 0.802665    
## schnell               -0.0208147  0.0224965  -0.925 0.354855    
## zentrum                0.0305566  0.0153294   1.993 0.046242 *  
## willkommen             0.0339360  0.0259116   1.310 0.190323    
## cosy                  -0.0383129  0.0125535  -3.052 0.002277 ** 
## `next`                 0.0203381  0.0183290   1.110 0.267182    
## schöneberg            -0.0713584  0.0233465  -3.056 0.002243 ** 
## take                   0.0097825  0.0282978   0.346 0.729574    
## gelegene              -0.0238740  0.0327496  -0.729 0.466023    
## mehr                  -0.0403367  0.0345839  -1.166 0.243494    
## modern                 0.0382764  0.0136429   2.806 0.005029 ** 
## unterkunft            -0.0068480  0.0134993  -0.507 0.611965    
## abenteurer            -0.0889668  0.0368628  -2.413 0.015814 *  
## adventurers            0.0168575  0.0694009   0.243 0.808086    
## alleinreisende        -0.1721441  0.0353601  -4.868 1.14e-06 ***
## amp                   -0.0041262  0.0079916  -0.516 0.605637    
## business               0.0378505  0.0319553   1.184 0.236241    
## couples               -0.0285087  0.0333772  -0.854 0.393042    
## geschäftsreisende      0.0769661  0.0276406   2.785 0.005367 ** 
## paare                  0.0655560  0.0269033   2.437 0.014833 *  
## solo                  -0.1037931  0.0654986  -1.585 0.113065    
## night                 -0.0030257  0.0299931  -0.101 0.919647    
## perfect                0.0002585  0.0168008   0.015 0.987726    
## gemütlich             -0.0467695  0.0201436  -2.322 0.020257 *  
## hbf                   -0.0367380  0.0199481  -1.842 0.065542 .  
## bahnstation           -0.0510878  0.0316898  -1.612 0.106957    
## bushaltestelle         0.0069445  0.0354083   0.196 0.844515    
## stadtteil             -0.0141114  0.0223330  -0.632 0.527487    
## west                   0.0060729  0.0221397   0.274 0.783859    
## öffentliche           -0.0093875  0.0380115  -0.247 0.804938    
## bathroom              -0.0361073  0.0173824  -2.077 0.037797 *  
## clubs                 -0.0479126  0.0216011  -2.218 0.026566 *  
## right                  0.0223583  0.0187445   1.193 0.232970    
## spacious               0.0429690  0.0149517   2.874 0.004061 ** 
## area                   0.0043708  0.0135300   0.323 0.746665    
## find                   0.0080154  0.0221917   0.361 0.717962    
## hip                    0.0060453  0.0330285   0.183 0.854775    
## quite                 -0.0529064  0.0267448  -1.978 0.047926 *  
## sunny                  0.0123350  0.0164643   0.749 0.453751    
## transportation        -0.0658683  0.0363833  -1.810 0.070254 .  
## etc                   -0.0367728  0.0190565  -1.930 0.053666 .  
## gehminuten             0.0462673  0.0168431   2.747 0.006022 ** 
## geschäfte             -0.0328735  0.0307713  -1.068 0.285395    
## kleiner               -0.0283936  0.0325822  -0.871 0.383525    
## kneipen               -0.0186916  0.0306086  -0.611 0.541430    
## ausgestattete          0.0521978  0.0294744   1.771 0.076589 .  
## voll                  -0.0226327  0.0232909  -0.972 0.331195    
## berg                  -0.0174849  0.0337473  -0.518 0.604387    
## close                 -0.0190916  0.0111876  -1.706 0.087936 .  
## large                  0.0550549  0.0172666   3.189 0.001433 ** 
## rooms                  0.0889164  0.0176288   5.044 4.62e-07 ***
## two                    0.0089165  0.0137339   0.649 0.516201    
## dennoch                0.0406930  0.0322213   1.263 0.206638    
## essen                  0.0167992  0.0357171   0.470 0.638117    
## findet                 0.0062441  0.0305108   0.205 0.837847    
## tolle                  0.0824318  0.0259009   3.183 0.001463 ** 
## besteht               -0.0257565  0.0373742  -0.689 0.490739    
## sonnige                0.0744645  0.0330986   2.250 0.024477 *  
## wedding               -0.1515361  0.0255795  -5.924 3.21e-09 ***
## centrally              0.0186651  0.0329933   0.566 0.571589    
## design                 0.1096467  0.0221811   4.943 7.77e-07 ***
## foot                   0.0580122  0.0299922   1.934 0.053103 .  
## metro                 -0.0124265  0.0187982  -0.661 0.508592    
## near                  -0.0719854  0.0140390  -5.128 2.97e-07 ***
## please                 0.0178387  0.0228277   0.781 0.434551    
## renovated              0.0413242  0.0245117   1.686 0.091836 .  
## towels                 0.0230894  0.0331628   0.696 0.486286    
## families               0.2178464  0.0385186   5.656 1.58e-08 ***
## drei                   0.0635669  0.0295109   2.154 0.031255 *  
## erreicht              -0.0171078  0.0247659  -0.691 0.489714    
## familien               0.1919201  0.0295049   6.505 8.03e-11 ***
## family                 0.1775234  0.0314141   5.651 1.62e-08 ***
## ottensen               0.0160634  0.0220072   0.730 0.465453    
## schön                 -0.0024359  0.0307289  -0.079 0.936818    
## innenstadt             0.0293021  0.0156434   1.873 0.061070 .  
## kindern                0.0837906  0.0367161   2.282 0.022496 *  
## spülmaschine           0.0748320  0.0391964   1.909 0.056261 .  
## place                  0.0009569  0.0129594   0.074 0.941142    
## plenty                -0.0222648  0.0344811  -0.646 0.518478    
## eigenem                0.0273029  0.0314128   0.869 0.384771    
## haus                   0.0236429  0.0185287   1.276 0.201969    
## green                 -0.0639329  0.0244850  -2.611 0.009034 ** 
## neukölln              -0.0498698  0.0122227  -4.080 4.53e-05 ***
## open                   0.0056514  0.0283148   0.200 0.841802    
## shower                 0.0146515  0.0277023   0.529 0.596889    
## space                 -0.0340906  0.0202161  -1.686 0.091757 .  
## well                  -0.0473388  0.0157506  -3.006 0.002656 ** 
## badezimmer            -0.0078532  0.0216824  -0.362 0.717213    
## groß                  -0.0448007  0.0240030  -1.866 0.061997 .  
## haltestelle           -0.0191489  0.0264057  -0.725 0.468353    
## kommt                  0.0437180  0.0306922   1.424 0.154352    
## rhein                  0.0195469  0.0297851   0.656 0.511665    
## zimmerwohnung          0.0314864  0.0245628   1.282 0.199907    
## building               0.0330484  0.0215751   1.532 0.125597    
## ferienwohnung          0.0935400  0.0214440   4.362 1.30e-05 ***
## grünen                -0.0905092  0.0241958  -3.741 0.000184 ***
## inklusive              0.0340858  0.0330113   1.033 0.301831    
## munich                 0.0835892  0.0129136   6.473 9.91e-11 ***
## münchen                0.0825240  0.0172882   4.773 1.83e-06 ***
## terrace                0.1067716  0.0233735   4.568 4.96e-06 ***
## verkehrsmitteln        0.0334585  0.0427257   0.783 0.433581    
## zahlreiche             0.0334719  0.0321144   1.042 0.297304    
## genießen               0.0333937  0.0348082   0.959 0.337392    
## immer                  0.0080604  0.0322288   0.250 0.802515    
## inkl                   0.0082213  0.0284506   0.289 0.772610    
## leben                 -0.0073690  0.0289317  -0.255 0.798956    
## lieben                 0.0176288  0.0351218   0.502 0.615723    
## connection            -0.0385095  0.0310545  -1.240 0.214972    
## loft                   0.1834140  0.0183816   9.978  < 2e-16 ***
## places                -0.0154866  0.0313826  -0.493 0.621683    
## welcome               -0.0715157  0.0191831  -3.728 0.000194 ***
## altona                -0.0060721  0.0196414  -0.309 0.757215    
## hafen                  0.0150357  0.0303043   0.496 0.619791    
## kleines               -0.1022711  0.0235241  -4.348 1.39e-05 ***
## pauli                  0.0415730  0.0173740   2.393 0.016732 *  
## can                   -0.0050004  0.0116228  -0.430 0.667040    
## charlottenburg        -0.0795109  0.0206600  -3.849 0.000119 ***
## coffee                 0.0646154  0.0306576   2.108 0.035079 *  
## double                -0.0039472  0.0196940  -0.200 0.841149    
## home                   0.0013945  0.0161026   0.087 0.930992    
## light                 -0.0219643  0.0231560  -0.949 0.342874    
## living                 0.1081499  0.0154779   6.987 2.92e-12 ***
## lots                   0.0197867  0.0257619   0.768 0.442464    
## love                   0.0040388  0.0244562   0.165 0.868833    
## machine                0.0494412  0.0442931   1.116 0.264343    
## market                 0.0551051  0.0342465   1.609 0.107621    
## new                    0.0561093  0.0197141   2.846 0.004431 ** 
## person                -0.0697206  0.0211931  -3.290 0.001005 ** 
## reach                 -0.0454023  0.0243221  -1.867 0.061962 .  
## shopping               0.0001910  0.0203827   0.009 0.992525    
## sleeping               0.0796198  0.0272577   2.921 0.003494 ** 
## street                 0.0014187  0.0187344   0.076 0.939638    
## washing               -0.1013254  0.0497157  -2.038 0.041558 *  
## comfy                 -0.1255172  0.0305184  -4.113 3.93e-05 ***
## entspannen             0.0147078  0.0322403   0.456 0.648257    
## kiez                  -0.0311071  0.0217257  -1.432 0.152220    
## seid                   0.0235401  0.0276767   0.851 0.395039    
## einfach               -0.0530529  0.0287035  -1.848 0.064578 .  
## geräumige              0.0349930  0.0349642   1.001 0.316929    
## supermarkt            -0.0221015  0.0277757  -0.796 0.426211    
## schwabing              0.1033279  0.0211311   4.890 1.02e-06 ***
## perfekte               0.0321097  0.0317770   1.010 0.312287    
## ausblick               0.0551619  0.0296221   1.862 0.062596 .  
## couch                 -0.0712591  0.0183138  -3.891 0.000100 ***
## hallo                 -0.0194405  0.0290184  -0.670 0.502908    
## neighborhood           0.0148674  0.0219497   0.677 0.498201    
## tür                    0.0353191  0.0189795   1.861 0.062777 .  
## wunderschöne           0.0826886  0.0251620   3.286 0.001018 ** 
## style                  0.0320481  0.0318602   1.006 0.314480    
## couple                 0.0036270  0.0340901   0.106 0.915271    
## feel                  -0.0021733  0.0268665  -0.081 0.935529    
## stylish                0.1152361  0.0289179   3.985 6.78e-05 ***
## supermärkte           -0.0331059  0.0249055  -1.329 0.183782    
## best                   0.0786466  0.0198424   3.964 7.42e-05 ***
## hamburger              0.0074455  0.0236326   0.315 0.752728    
## bitte                  0.0390429  0.0238507   1.637 0.101657    
## eingerichtete          0.0512278  0.0281995   1.817 0.069295 .  
## elbe                  -0.0041152  0.0234557  -0.175 0.860733    
## flur                  -0.0310269  0.0330494  -0.939 0.347846    
## genutzt               -0.0409369  0.0300114  -1.364 0.172574    
## großer                 0.0224288  0.0251077   0.893 0.371710    
## herzlich              -0.0487032  0.0382508  -1.273 0.202946    
## leute                 -0.0178801  0.0315529  -0.567 0.570947    
## liebevoll             -0.0135968  0.0342997  -0.396 0.691807    
## schlafen               0.0244289  0.0311726   0.784 0.433249    
## tollen                -0.0020099  0.0357569  -0.056 0.955174    
## vielen                 0.0360216  0.0235108   1.532 0.125513    
## wohnküche              0.0110719  0.0274855   0.403 0.687081    
## aufenthalt             0.0058531  0.0271748   0.215 0.829468    
## like                  -0.0188466  0.0219959  -0.857 0.391558    
## live                  -0.0359883  0.0261743  -1.375 0.169169    
## separate               0.0073936  0.0280202   0.264 0.791886    
## clean                 -0.0223431  0.0235213  -0.950 0.342174    
## dusche                 0.0040944  0.0217994   0.188 0.851018    
## helles                -0.0527355  0.0188620  -2.796 0.005183 ** 
## maisonette             0.1363753  0.0258500   5.276 1.34e-07 ***
## schlafsofa             0.0341525  0.0288134   1.185 0.235918    
## zentralen              0.0220100  0.0341652   0.644 0.519440    
## ideal                 -0.0011295  0.0192521  -0.059 0.953215    
## raum                  -0.0609058  0.0262473  -2.320 0.020329 *  
## amazing                0.0864382  0.0322674   2.679 0.007397 ** 
## marienplatz            0.0858374  0.0211980   4.049 5.16e-05 ***
## mins                  -0.0465658  0.0171212  -2.720 0.006540 ** 
## stops                  0.0097886  0.0312474   0.313 0.754087    
## travelers             -0.0320675  0.0403124  -0.795 0.426350    
## zeit                  -0.0190542  0.0276181  -0.690 0.490258    
## fair                   0.0959029  0.0221523   4.329 1.51e-05 ***
## parkplatz              0.0475967  0.0349451   1.362 0.173206    
## fährt                 -0.0848064  0.0327125  -2.592 0.009538 ** 
## unserer               -0.0019063  0.0249059  -0.077 0.938991    
## wohn                   0.0148033  0.0280941   0.527 0.598258    
## ubahn                 -0.0054081  0.0237566  -0.228 0.819925    
## wegen                 -0.0294263  0.0373734  -0.787 0.431084    
## fußweg                 0.0249783  0.0206861   1.207 0.227262    
## straßenbahn           -0.0075588  0.0312250  -0.242 0.808726    
## charmante              0.0519098  0.0303902   1.708 0.087637 .  
## sternschanze           0.0174594  0.0268229   0.651 0.515112    
## kaffee                 0.0118242  0.0378447   0.312 0.754711    
## apt                    0.1550890  0.0247717   6.261 3.94e-10 ***
## old                   -0.0316271  0.0217917  -1.451 0.146707    
## schanze                0.0203551  0.0227707   0.894 0.371382    
## szeneviertel           0.0053815  0.0344135   0.156 0.875738    
## trotzdem               0.0149152  0.0292087   0.511 0.609609    
## biete                 -0.0554153  0.0281747  -1.967 0.049220 *  
## großen                 0.0169452  0.0182168   0.930 0.352283    
## beliebten              0.0070505  0.0297871   0.237 0.812895    
## schlafcouch           -0.0393363  0.0248342  -1.584 0.113224    
## whg                    0.0858197  0.0250940   3.420 0.000628 ***
## dining                 0.1040366  0.0377715   2.754 0.005888 ** 
## persons                0.0121056  0.0290650   0.417 0.677049    
## supermarket           -0.0903169  0.0312759  -2.888 0.003886 ** 
## table                 -0.0388291  0.0350301  -1.108 0.267686    
## within                -0.0052036  0.0205301  -0.253 0.799916    
## art                    0.0485094  0.0277438   1.748 0.080402 .  
## läden                 -0.0693200  0.0372139  -1.863 0.062518 .  
## freunde                0.0018007  0.0373162   0.048 0.961513    
## gegenüber              0.0884683  0.0316014   2.800 0.005125 ** 
## könnt                 -0.0073215  0.0251126  -0.292 0.770637    
## decken                -0.0096048  0.0478037  -0.201 0.840763    
## hohen                  0.0082189  0.0493542   0.167 0.867742    
## east                  -0.0589399  0.0365483  -1.613 0.106841    
## alexanderplatz        -0.0013642  0.0173007  -0.079 0.937151    
## sleep                 -0.0822790  0.0305517  -2.693 0.007087 ** 
## gegend                -0.0830992  0.0351326  -2.365 0.018028 *  
## nette                 -0.0502778  0.0297856  -1.688 0.091434 .  
## str                    0.0480034  0.0303332   1.583 0.113548    
## lot                    0.0094123  0.0249370   0.377 0.705849    
## vermieten              0.0508696  0.0262474   1.938 0.052632 .  
## linie                 -0.0360552  0.0277221  -1.301 0.193419    
## time                  -0.0724765  0.0237121  -3.057 0.002243 ** 
## flughafen             -0.0189658  0.0206114  -0.920 0.357502    
## tegel                 -0.0740299  0.0355558  -2.082 0.037353 *  
## parkplätze            -0.0497401  0.0359865  -1.382 0.166934    
## ruhiger               -0.0434152  0.0280546  -1.548 0.121758    
## kleinen               -0.0531419  0.0223141  -2.382 0.017253 *  
## bieten                -0.0011059  0.0262972  -0.042 0.966457    
## möglich                0.0097431  0.0298723   0.326 0.744308    
## windows               -0.0252808  0.0334982  -0.755 0.450446    
## gäste                  0.0386968  0.0204940   1.888 0.059018 .  
## weit                  -0.0869000  0.0351455  -2.473 0.013425 *  
## offer                 -0.0200518  0.0267544  -0.749 0.453582    
## share                 -0.1108477  0.0292826  -3.785 0.000154 ***
## bedrooms               0.3374616  0.0315470  10.697  < 2e-16 ***
## ruhiges               -0.0618790  0.0235110  -2.632 0.008499 ** 
## tag                    0.0209486  0.0324791   0.645 0.518945    
## free                   0.0102511  0.0212293   0.483 0.629191    
## full                  -0.0259025  0.0264483  -0.979 0.327416    
## berliner               0.0158723  0.0211752   0.750 0.453527    
## guest                  0.0389975  0.0307485   1.268 0.204720    
## huge                   0.0501205  0.0251305   1.994 0.046126 *  
## nachtleben            -0.0265521  0.0361239  -0.735 0.462334    
## kultur                -0.0036039  0.0381981  -0.094 0.924834    
## wohlfühlen            -0.0158425  0.0321484  -0.493 0.622165    
## stationen             -0.0283611  0.0252611  -1.123 0.261575    
## available              0.0176864  0.0242992   0.728 0.466711    
## lively                 0.0519310  0.0348526   1.490 0.136241    
## renting                0.0223174  0.0348973   0.640 0.522497    
## use                    0.0007842  0.0225515   0.035 0.972262    
## isar                   0.0291499  0.0234130   1.245 0.213141    
## parking                0.0127584  0.0311613   0.409 0.682230    
## main                   0.0095301  0.0196461   0.485 0.627622    
## enjoy                 -0.0164929  0.0209865  -0.786 0.431950    
## want                  -0.0007066  0.0279661  -0.025 0.979843    
## desk                  -0.1243197  0.0367558  -3.382 0.000721 ***
## guests                 0.0078682  0.0256251   0.307 0.758809    
## shared                -0.1426690  0.0203152  -7.023 2.27e-12 ***
## frisch                -0.0333642  0.0346415  -0.963 0.335500    
## renoviert             -0.0094209  0.0381375  -0.247 0.804893    
## mauerpark             -0.0294266  0.0259880  -1.132 0.257520    
## ceilings               0.0061717  0.0404690   0.153 0.878791    
## high                   0.0718792  0.0272675   2.636 0.008396 ** 
## finden                 0.0021926  0.0313633   0.070 0.944267    
## steht                  0.0302083  0.0300388   1.006 0.314604    
## explore               -0.0657959  0.0347022  -1.896 0.057977 .  
## downtown               0.0469254  0.0330444   1.420 0.155608    
## happy                  0.0655928  0.0353183   1.857 0.063304 .  
## river                  0.0017274  0.0273660   0.063 0.949671    
## extra                  0.0233837  0.0288801   0.810 0.418136    
## bathtub                0.0358605  0.0348730   1.028 0.303817    
## trendy                 0.0524130  0.0312031   1.680 0.093029 .  
## day                   -0.0288107  0.0324654  -0.887 0.374864    
## direct                -0.0231888  0.0308943  -0.751 0.452913    
## english               -0.0772021  0.0315277  -2.449 0.014348 *  
## looking               -0.0490750  0.0356830  -1.375 0.169058    
## bequem                 0.0238798  0.0364884   0.654 0.512833    
## ehrenfeld             -0.0326015  0.0189503  -1.720 0.085386 .  
## short                 -0.0781991  0.0359403  -2.176 0.029585 *  
## grüne                 -0.0233250  0.0344461  -0.677 0.498324    
## fast                  -0.0285599  0.0293243  -0.974 0.330106    
## per                   -0.0590277  0.0296820  -1.989 0.046756 *  
## directly              -0.0150191  0.0288145  -0.521 0.602210    
## train                 -0.0271745  0.0194260  -1.399 0.161872    
## bath                   0.0429965  0.0298548   1.440 0.149836    
## esstisch               0.0734508  0.0358507   2.049 0.040499 *  
## gleich                 0.0073776  0.0335368   0.220 0.825885    
## size                  -0.0096066  0.0290500  -0.331 0.740882    
## habt                   0.0119305  0.0337933   0.353 0.724060    
## süd                    0.0226630  0.0323521   0.701 0.483619    
## rent                   0.0174866  0.0267008   0.655 0.512536    
## fragen                -0.0261660  0.0340773  -0.768 0.442593    
## easily                 0.0084806  0.0348679   0.243 0.807837    
## friendly              -0.0757337  0.0280642  -2.699 0.006971 ** 
## allee                 -0.0630236  0.0301682  -2.089 0.036718 *  
## ganz                   0.0021061  0.0237195   0.089 0.929248    
## innerhalb             -0.0481883  0.0288435  -1.671 0.094805 .  
## kannst                -0.0732117  0.0278779  -2.626 0.008644 ** 
## check                 -0.0180616  0.0219445  -0.823 0.410486    
## stop                  -0.0002318  0.0325200  -0.007 0.994313    
## gästezimmer           -0.1661195  0.0232591  -7.142 9.62e-13 ***
## museum                 0.0386519  0.0280490   1.378 0.168220    
## bar                    0.0641080  0.0321427   1.994 0.046117 *  
## frankfurter            0.0249871  0.0335047   0.746 0.455813    
## natürlich             -0.0343196  0.0314490  -1.091 0.275169    
## eigenes               -0.0146029  0.0330543  -0.442 0.658652    
## stuttgart             -0.0633765  0.0204877  -3.093 0.001982 ** 
## frankfurt              0.0048621  0.0142180   0.342 0.732381    
## dresden               -0.1346446  0.0196418  -6.855 7.41e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4526 on 14880 degrees of freedom
## Multiple R-squared:  0.3557, Adjusted R-squared:  0.3304 
## F-statistic: 14.02 on 586 and 14880 DF,  p-value: < 2.2e-16

F stats: I can reject the null hypothesis that all of the regression coefficients are equal to zero (p-value < 0.01).

Adjusted R-squared: Nearly \(32\%\) of the variance of the price can be explained by our model.

Coefficients

The inevitable multicollinearity makes individual parameters difficult to interpret. However, it is still a good exercise to look at the most important coefficients to see if they make intuitive sense in the context of a particular application. “Most important” can be defined in a number of ways. Here, I will rank the estimated coefficients by their absolute value.

The plot below show all terms with a p-value < 0.01.

library(broom)

p <- tidy(fit.lm) %>%
  filter(p.value < 0.01) %>%
  filter(term != "(Intercept)") %>%
  mutate(pos = factor(ifelse(estimate>=0,1,0))) %>%
  #top_n(20,estimate) %>%
  ggplot(aes(reorder(term, estimate),estimate,
             fill = pos)) +
  geom_col(show.legend = F, alpha = 0.8) +
  coord_flip() +
  scale_fill_manual(values = c(col[1],col[2])) +
  labs(x="Estimate", y="", title ="Coefficients with p<0.01")

ggsave("../figs/coefficientslm.png", p, height = 14)

Coefficients

par(mfrow=c(2,2))
plot(fit.lm)

(2) Ridge Regression

library(glmnet)

Ridge Regression is a regularization method that tries to avoid overfitting. Like OLS, ridge attempts to minimize residual sum of squares of predictors in a given model. However, ridge regression includes an additional “shrinkage”" allowing some coefficients with minor contribution to the response to get close to zero. In a ridge regression the following term will be minimized:

\[ \text{RSS}(\beta) + \lambda \sum^p_{j=1}\beta^2_j \]

From this we can easily see, that:

\(\lambda = 0\): coefficients equal OLS coefficients
\(\lambda = \infty\): coefficients approach zero

#fit the model
fit.ridge <- cv.glmnet(x.train, y.train, family='gaussian', alpha=0)

The above code, performs 10-fold cross validation to choose the best \(\lambda\). Moreover, it estimates a linear regression (family = “gaussian”).

#Results
plot(fit.ridge)

Minimum \(\lambda\):

lam_min <- fit.ridge$lambda.min
lam_min

## [1] 0.09945955

(3) Least Absolute Shrinkage and Selection Operator (LASSO)

Similar to the Ridge Regression, Lasso also tries to avoid overfitting. The difference is, that it uses the L1 Norm to penalize large coefficients (Lasso is aka L1 Regularization):

\[ \text{RSS}(\beta) + \lambda \sum^p_{j=1}|\beta_j| \]

Lasso can be used to perform variable selection, as it can shrink some of the coefficients to exactly zero.

# Fitting the model (Lasso: Alpha = 1)
fit.lasso <- cv.glmnet(x.train, y.train, family='gaussian', 
                       alpha=1)

The above code, performs 10-fold cross validation to choose the best \(\lambda\). Moreover, it estimates a linear regression (family = “gaussian”).

#Results
plot(fit.lasso)

Minimum \(\lambda\):

lam_min <- fit.lasso$lambda.min
lam_min

## [1] 0.002832644

Make Predictions

rmse <- function(error) {
  sqrt(mean(error^2))
  }

How well do the models perform when predicting the test data?

Linear Regresion

pred.lm <- as.data.frame(predict(fit.lm, 
                newdata = df.test %>% select(-log_price)))

# Combine predictions with test dataframe
df.test$pred.lm <- pred.lm[,1]
df.test$error.lm <- df.test$log_price - df.test$pred.lm

rmse.lm <- rmse(df.test$error.lm)
print(paste0("The RMSE of the Linear Regression is: ", 
             rmse.lm))

## [1] "The RMSE of the Linear Regression is: 0.474996653299447"

Ridge Regression

pred.ridge <- as.data.frame(predict(fit.ridge, x.test))

# Combine predictions with test dataframe
df.test$pred.ridge <- pred.ridge[,1]
df.test$error.ridge <- df.test$log_price - df.test$pred.ridge

rmse.ridge <- rmse(df.test$error.ridge)
print(paste0("The RMSE of the Ridge Regression is: ", 
             rmse.ridge))

## [1] "The RMSE of the Ridge Regression is: 0.475314709569063"

LASSO

pred.lasso <- as.data.frame(predict(fit.lasso, x.test))

# Combine predictions with test dataframe
df.test$pred.lasso <- pred.lasso[,1]
df.test$error.lasso <- df.test$log_price - df.test$pred.lasso

rmse.lasso <- rmse(df.test$error.lasso)
print(paste0("The RMSE of the LASSO is: ", 
             rmse.lasso))

## [1] "The RMSE of the LASSO is: 0.475182995049206"

A plot of the predicted values agains the actual values shows the explanatory power of the prediction models.

df.test %>% 
  select(pred.lm, pred.ridge, pred.lasso,
         log_price) %>%
  tidyr::gather(model, "predictions", pred.lm:pred.lasso) -> plot

p <- df.test %>%
  ggplot(aes(pred.lm, log_price)) +
  geom_point(alpha = 0.8) +
  geom_smooth(method = lm) +
  labs(x="Predicted y", y="Actual y",
       title = "Predicted vs. True Values",
       subtitle = "OLS with Text Data",
       caption = paste0("RMSE: ", round(rmse.lm,3)))

ggsave("../figs/residplot2.png", p)

Residual Plot

p <- df.test %>%
  ggplot(aes(pred.ridge, log_price)) +
  geom_point(alpha = 0.8) +
  geom_smooth(method = lm) +
  labs(x="Predicted y", y="Actual y",
       title = "Predicted vs. True Values",
       subtitle = "Ridge Reg with Text Data",
       caption = paste0("RMSE: ", round(rmse.ridge,3)))

ggsave("../figs/residplot3.png", p)

Residual Plot

p<- df.test %>%
  ggplot(aes(pred.lasso, log_price)) +
  geom_point(alpha = 0.8) +
  geom_smooth(method = lm) +
  labs(x="Predicted y", y="Actual y",
       title = "Predicted vs. True Values",
       subtitle = "LASSO with Text Data",
       caption = paste0("RMSE: ", round(rmse.lasso,3)))

ggsave("../figs/residplot4.png", p)

Residual Plot

Based on the RMSE, it can be concluded that LASSO performs best when I use the description text as exogenous variables. However, compared to the regression with structural variables from part two, LASSO performs worse.

Structured vs. Text Data to predict Airbnb prices

Part 3: Regression with Text Data

Document Term Matrix

Training / Test Split

Estimation

(1) Linear Regresion

(2) Ridge Regression

(3) Least Absolute Shrinkage and Selection Operator (LASSO)

Make Predictions