Can we improve the model from part 2 using the description text of the listing as exogenous variables?
As before, the goal is to predict the price \(y_i\) for each Airbnb listing \(i\) based on some independent variables \(X\). In this model the counts of the unique terms \(c_i\) of each listing description are used. This is a regression problem like any other, except that the high-dimensionality of \(c_i\) makes OLS and other standard techniques difficult to estimate (overfitting).
Here, we have \(d=20.637\) documents (Airbnb listings) each of which is \(w\) words long. Each word is drawn from a vocabulary of \(p=33.469\) possible words. The unique representation of each document has dimension \(p^w\). A common strategy to deal with the high-dimensionality of text data is the estimation of penalized linear models (Gentzkow, 2017).
I will estimate three different models using the same training data: (1) Linear Regression (as in part 2), (2) Penalized Linear Regression using Ridge (L2 Norm) and (3) Penlalized Linear Regression using Lasso (L1 Norm). I will then use these models to make predictions on the test data to see which performs better.
A first step to use textdata in a prediction model is to convert it to a Document Term Matrix, where each row is a observation (document) and each column is a unique term.
corp <- Corpus(VectorSource(df$text_cleaned))
dtm <- DocumentTermMatrix(corp)
dtm
## <<DocumentTermMatrix (documents: 20637, terms: 33469)>>
## Non-/sparse entries: 650794/690048959
## Sparsity : 100%
## Maximal term length: 132
## Weighting : term frequency (tf)
The first five observations of the Document Term Matrix look like this:
inspect(dtm[1:5, 100:107])
## <<DocumentTermMatrix (documents: 5, terms: 8)>>
## Non-/sparse entries: 9/31
## Sparsity : 78%
## Maximal term length: 12
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs bäcker bus ebenfalls ebenso eingerichtet emili englischen fuß
## 1 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 1 1 1 1 1 1 1 2
## 4 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 1
Matrices in text analysis problems tend to be very sparse. That is, most of the elements are zero, which implies that they have many parameters that are uninformative.
Reducing sparsity tends to have the effect of both reducing overfitting and improving the predictive abilities of the model. Here we are reducing the sparsity of the document-term matrix so that the sparsity (% of non-zeros) is a maximum of 99%.
dtm<-removeSparseTerms(dtm,0.99)
dtm
## <<DocumentTermMatrix (documents: 20637, terms: 586)>>
## Non-/sparse entries: 403573/11689709
## Sparsity : 97%
## Maximal term length: 21
## Weighting : term frequency (tf)
inspect(dtm[1:5, 1:7])
## <<DocumentTermMatrix (documents: 5, terms: 7)>>
## Non-/sparse entries: 7/28
## Sparsity : 80%
## Maximal term length: 12
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs ausgestattet bad badewanne befindet gelegen gemütlichen gibt
## 1 1 1 1 1 1 1 1
## 2 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0
# Convert to Dataframe
dtm.df <- as.matrix(dtm) %>%
as.data.frame()
# Merge with orignal dataframe
dtm.df$document <-as.integer(rownames(dtm.df))
df$document <- as.integer(rownames(df))
df.reg <- dtm.df %>%
left_join(df %>%
select(document, price),
by = "document") %>%
filter(price != 0) %>%
mutate(log_price = log(price)) %>%
select(-document, -price)
#define % of training and test set
bound <- floor((nrow(df.reg)/4)*3)
#sample rows
df.reg <- df.reg[sample(nrow(df.reg)), ]
# train data
df.train <- df.reg[1:bound, ]
x.train <- as.matrix(df.train %>% select(-log_price))
y.train <- as.matrix(df.train %>% select(log_price))
# test data
df.test <- df.reg[(bound+1):nrow(df.reg), ]
x.test <- as.matrix(df.test %>% select(-log_price))
y.test <- as.matrix(df.test %>% select(log_price))
fit.lm <- lm(log_price~., data = df.train)
summary(fit.lm)
##
## Call:
## lm(formula = log_price ~ ., data = df.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.17999 -0.28385 -0.01883 0.26048 2.67883
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.2272067 0.0099419 425.191 < 2e-16 ***
## ausgestattet 0.0015780 0.0175820 0.090 0.928488
## bad -0.0105836 0.0141975 -0.745 0.456007
## badewanne -0.0051111 0.0229772 -0.222 0.823971
## befindet -0.0157357 0.0115442 -1.363 0.172877
## gelegen -0.0182134 0.0155811 -1.169 0.242445
## gemütlichen -0.0492232 0.0250931 -1.962 0.049825 *
## gibt -0.0063967 0.0123702 -0.517 0.605091
## große 0.0852306 0.0147256 5.788 7.27e-09 ***
## großem -0.0341203 0.0272284 -1.253 0.210184
## gut -0.0184871 0.0157399 -1.175 0.240197
## küche -0.0209072 0.0124514 -1.679 0.093152 .
## lädt 0.0080828 0.0371274 0.218 0.827663
## moderne 0.1096874 0.0224613 4.883 1.05e-06 ***
## schlafzimmer 0.1326854 0.0149071 8.901 < 2e-16 ***
## schönes -0.0545600 0.0172711 -3.159 0.001586 **
## terrasse 0.0870699 0.0225338 3.864 0.000112 ***
## vorhanden -0.0066288 0.0176891 -0.375 0.707860
## wlan 0.0272998 0.0187246 1.458 0.144872
## wohnung 0.0421703 0.0056075 7.520 5.78e-14 ***
## wohnzimmer 0.0155356 0.0151299 1.027 0.304523
## apartment 0.0746373 0.0056505 13.209 < 2e-16 ***
## around -0.0171555 0.0217623 -0.788 0.430526
## balcony 0.0064247 0.0118593 0.542 0.588003
## bed 0.0147667 0.0129647 1.139 0.254727
## bright -0.0260227 0.0130274 -1.998 0.045785 *
## central 0.0253044 0.0099153 2.552 0.010718 *
## city 0.0265573 0.0091792 2.893 0.003819 **
## corner 0.0314928 0.0287824 1.094 0.273899
## cozy -0.0605017 0.0109373 -5.532 3.23e-08 ***
## distance 0.0221905 0.0265621 0.835 0.403495
## district -0.0419405 0.0256535 -1.635 0.102095
## equipped -0.0099017 0.0242858 -0.408 0.683488
## famous 0.0190003 0.0236676 0.803 0.422103
## fully 0.0060799 0.0240739 0.253 0.800619
## furnished 0.0069534 0.0228430 0.304 0.760826
## great -0.0182561 0.0153623 -1.188 0.234708
## hamburg -0.0101505 0.0110159 -0.921 0.356837
## house -0.0119051 0.0219289 -0.543 0.587210
## little -0.0225041 0.0232835 -0.967 0.333798
## middle 0.0231360 0.0274956 0.841 0.400113
## min -0.0272186 0.0050563 -5.383 7.43e-08 ***
## neustadt -0.1168588 0.0202939 -5.758 8.66e-09 ***
## one 0.0259935 0.0117029 2.221 0.026358 *
## public 0.0030808 0.0300647 0.102 0.918382
## quiet -0.0333059 0.0117427 -2.836 0.004570 **
## reeperbahn 0.0350781 0.0254327 1.379 0.167838
## shops -0.0062047 0.0190579 -0.326 0.744754
## side 0.0601929 0.0342783 1.756 0.079108 .
## transport -0.0214931 0.0312633 -0.687 0.491787
## walking 0.0169308 0.0231493 0.731 0.464563
## wifi 0.0281592 0.0184567 1.526 0.127107
## bahn 0.0012143 0.0076904 0.158 0.874545
## balkon 0.0142836 0.0100466 1.422 0.155127
## bars 0.0001337 0.0125924 0.011 0.991532
## befinden -0.0195278 0.0217744 -0.897 0.369826
## braucht -0.0320069 0.0230030 -1.391 0.164119
## bus -0.0065785 0.0116846 -0.563 0.573443
## bäcker -0.0166970 0.0345527 -0.483 0.628937
## ebenfalls 0.0635662 0.0264886 2.400 0.016418 *
## eingerichtet 0.0413982 0.0217465 1.904 0.056973 .
## fuß 0.0210797 0.0128362 1.642 0.100567
## garten 0.0209353 0.0156985 1.334 0.182360
## gemütliche -0.0343760 0.0126727 -2.713 0.006683 **
## helle 0.0116063 0.0153373 0.757 0.449220
## innenhof 0.0423675 0.0299923 1.413 0.157790
## kleine -0.0684600 0.0169583 -4.037 5.44e-05 ***
## lage -0.0189328 0.0120287 -1.574 0.115517
## liebe -0.0727968 0.0308751 -2.358 0.018397 *
## nähe -0.0364430 0.0111999 -3.254 0.001141 **
## restaurant 0.0927973 0.0314932 2.947 0.003218 **
## rewe 0.0008176 0.0342977 0.024 0.980982
## ruhig -0.0060956 0.0136155 -0.448 0.654378
## schöne 0.0400002 0.0133553 2.995 0.002748 **
## sowie -0.0207920 0.0132880 -1.565 0.117670
## super -0.0118280 0.0142130 -0.832 0.405313
## top 0.0280171 0.0168849 1.659 0.097076 .
## tram 0.0017737 0.0148211 0.120 0.904744
## unmittelbarer 0.0569164 0.0209369 2.718 0.006566 **
## verkehrsmittel 0.0286939 0.0368293 0.779 0.435931
## viele -0.0123045 0.0192811 -0.638 0.523376
## zentraler 0.0287066 0.0240912 1.192 0.233445
## öffentlichen -0.0420288 0.0336979 -1.247 0.212335
## ausstattung 0.1406991 0.0306480 4.591 4.45e-06 ***
## bietet 0.0038632 0.0165232 0.234 0.815140
## blick 0.0256030 0.0192029 1.333 0.182458
## cafes 0.0160609 0.0181911 0.883 0.377305
## gemütlichkeit -0.0049912 0.0394485 -0.127 0.899319
## mitten 0.0575851 0.0138532 4.157 3.25e-05 ***
## perfekt 0.0194842 0.0217817 0.895 0.371058
## restaurants 0.0549903 0.0115340 4.768 1.88e-06 ***
## ruhe -0.0392606 0.0342331 -1.147 0.251458
## schönen -0.0354990 0.0163021 -2.178 0.029453 *
## schönsten 0.0205983 0.0347993 0.592 0.553916
## stadt 0.0049865 0.0168819 0.295 0.767713
## stock -0.0243093 0.0229297 -1.060 0.289087
## doppelbett 0.0057392 0.0180476 0.318 0.750488
## matratze -0.0504601 0.0363941 -1.386 0.165618
## messe 0.0872943 0.0118760 7.350 2.08e-13 ***
## oktoberfest 0.2674595 0.0119351 22.410 < 2e-16 ***
## personen 0.0360438 0.0131007 2.751 0.005943 **
## stadtzentrum -0.0116049 0.0254652 -0.456 0.648602
## zimmer -0.1084624 0.0059340 -18.278 < 2e-16 ***
## beautiful 0.0068823 0.0117726 0.585 0.558820
## berlin -0.0221973 0.0063600 -3.490 0.000484 ***
## big 0.0211750 0.0113727 1.862 0.062635 .
## cool 0.0201674 0.0350777 0.575 0.565343
## diverse -0.0713262 0.0351468 -2.029 0.042438 *
## everything 0.0125282 0.0218337 0.574 0.566110
## friedrichshain -0.0232669 0.0135908 -1.712 0.086925 .
## just -0.0063141 0.0131233 -0.481 0.630429
## kreuzberg 0.0348044 0.0117793 2.955 0.003135 **
## meters 0.0268615 0.0288529 0.931 0.351880
## minutes -0.0353404 0.0095027 -3.719 0.000201 ***
## mitte 0.1026752 0.0124148 8.270 < 2e-16 ***
## nearby -0.0327500 0.0263133 -1.245 0.213293
## need -0.0211064 0.0222383 -0.949 0.342584
## neighbourhood 0.0044518 0.0293598 0.152 0.879483
## situated -0.0134587 0.0284715 -0.473 0.636430
## stations 0.0198568 0.0266727 0.744 0.456610
## stay -0.0004650 0.0174833 -0.027 0.978782
## straße -0.0409802 0.0183789 -2.230 0.025779 *
## subway 0.0492660 0.0194151 2.538 0.011175 *
## tor 0.0832458 0.0241670 3.445 0.000573 ***
## außerdem -0.0060825 0.0288474 -0.211 0.833007
## gemütliches -0.1132463 0.0154768 -7.317 2.66e-13 ***
## gerne -0.0391902 0.0184674 -2.122 0.033843 *
## parks 0.0029712 0.0199765 0.149 0.881763
## spree -0.0241459 0.0305209 -0.791 0.428882
## vermiete -0.0325655 0.0236561 -1.377 0.168650
## viertel -0.0004609 0.0204736 -0.023 0.982039
## zuhause 0.0468754 0.0318377 1.472 0.140954
## zwei 0.0239291 0.0131607 1.818 0.069049 .
## bedroom 0.0101992 0.0146708 0.695 0.486941
## direkter -0.0037489 0.0281091 -0.133 0.893903
## garden 0.0986613 0.0210609 4.685 2.83e-06 ***
## hinterhof -0.0326399 0.0303210 -1.076 0.281731
## ruhigen -0.0212747 0.0188476 -1.129 0.259010
## verfügt 0.0085270 0.0204054 0.418 0.676039
## waschmaschine 0.0143476 0.0229663 0.625 0.532161
## wunderschönen 0.0560929 0.0319584 1.755 0.079248 .
## zentral 0.0290875 0.0120755 2.409 0.016017 *
## zugang 0.0117976 0.0290911 0.406 0.685088
## appartement 0.0346240 0.0171681 2.017 0.043738 *
## away -0.0088577 0.0126724 -0.699 0.484579
## door -0.0065333 0.0299309 -0.218 0.827216
## front -0.0149020 0.0313047 -0.476 0.634060
## good -0.0367215 0.0197328 -1.861 0.062772 .
## kitchen 0.0133101 0.0160066 0.832 0.405684
## nice -0.0314263 0.0111012 -2.831 0.004648 **
## really -0.0128688 0.0291311 -0.442 0.658673
## relax 0.0492222 0.0386989 1.272 0.203419
## room -0.1246793 0.0064275 -19.398 < 2e-16 ***
## see -0.0652959 0.0294538 -2.217 0.026646 *
## small -0.0659632 0.0174887 -3.772 0.000163 ***
## sqm 0.0605256 0.0218520 2.770 0.005616 **
## takes -0.0726685 0.0337726 -2.152 0.031437 *
## underground 0.0486573 0.0284827 1.708 0.087599 .
## walk 0.0182623 0.0124433 1.468 0.142223
## appartment 0.0650321 0.0140900 4.615 3.95e-06 ***
## entfernt -0.0148852 0.0116717 -1.275 0.202215
## fernseher -0.0171077 0.0326043 -0.525 0.599796
## nah -0.0763087 0.0267219 -2.856 0.004301 **
## people 0.0256275 0.0162798 1.574 0.115465
## ruhige -0.0254296 0.0167809 -1.515 0.129693
## cafés 0.0264870 0.0162035 1.635 0.102145
## ecke 0.0197859 0.0215652 0.917 0.358899
## markt 0.0878370 0.0286004 3.071 0.002136 **
## prenzlauer 0.0408291 0.0354271 1.152 0.249142
## strasse -0.0116010 0.0309751 -0.375 0.708019
## unsere 0.0174291 0.0154623 1.127 0.259676
## zentrale 0.0826315 0.0204897 4.033 5.54e-05 ***
## internet -0.0118211 0.0231573 -0.510 0.609729
## modernes 0.0148115 0.0294731 0.503 0.615294
## dass 0.0120850 0.0309205 0.391 0.695919
## deutz 0.0210487 0.0286016 0.736 0.461786
## herzen 0.0893597 0.0132752 6.731 1.74e-11 ***
## köln -0.0277469 0.0131336 -2.113 0.034646 *
## nahe -0.0396290 0.0211971 -1.870 0.061566 .
## access 0.0290595 0.0243583 1.193 0.232888
## airport -0.0568397 0.0200457 -2.836 0.004582 **
## center 0.0200194 0.0121686 1.645 0.099955 .
## easy -0.0394133 0.0275989 -1.428 0.153291
## floor 0.0089184 0.0174592 0.511 0.609489
## located -0.0051653 0.0118827 -0.435 0.663793
## location 0.0127368 0.0155421 0.820 0.412513
## lovely 0.0064576 0.0173651 0.372 0.709993
## minute -0.0114820 0.0206661 -0.556 0.578496
## park 0.0040679 0.0127668 0.319 0.750011
## supermarkets -0.0811938 0.0263650 -3.080 0.002077 **
## view 0.0122063 0.0191641 0.637 0.524177
## alster 0.0524945 0.0196509 2.671 0.007563 **
## altbau 0.0206829 0.0145884 1.418 0.156281
## altbauwohnung 0.0761499 0.0137577 5.535 3.16e-08 ***
## charming 0.0821146 0.0207380 3.960 7.54e-05 ***
## comfortable -0.0396128 0.0175003 -2.264 0.023616 *
## eimsbüttel 0.0171372 0.0196223 0.873 0.382486
## flat 0.0359521 0.0076212 4.717 2.41e-06 ***
## floors 0.0486565 0.0424595 1.146 0.251834
## heart 0.0355267 0.0129317 2.747 0.006017 **
## many 0.0172921 0.0212225 0.815 0.415199
## offers 0.0151267 0.0328505 0.460 0.645186
## part -0.0159139 0.0303145 -0.525 0.599618
## popular -0.0036247 0.0363784 -0.100 0.920632
## wooden -0.0791158 0.0421691 -1.876 0.060654 .
## anbindung -0.0163663 0.0191839 -0.853 0.393603
## auto 0.0621589 0.0329836 1.885 0.059512 .
## berlins 0.0112818 0.0229913 0.491 0.623647
## bettwäsche 0.0184373 0.0385309 0.479 0.632296
## direkt 0.0041100 0.0102488 0.401 0.688407
## erkunden -0.0190442 0.0322385 -0.591 0.554713
## erreichen 0.0137518 0.0143984 0.955 0.339547
## fenster -0.0266810 0.0268862 -0.992 0.321034
## freuen -0.0504052 0.0307414 -1.640 0.101099
## fuss 0.0500963 0.0245417 2.041 0.041242 *
## handtücher -0.0326666 0.0375168 -0.871 0.383921
## haustür 0.0185165 0.0254601 0.727 0.467068
## kleiderschrank -0.0222209 0.0385094 -0.577 0.563931
## komplett 0.0251911 0.0224358 1.123 0.261538
## kurfürstendamm 0.0501545 0.0301395 1.664 0.096118 .
## kühlschrank -0.0590572 0.0305437 -1.934 0.053190 .
## liegt -0.0175621 0.0109397 -1.605 0.108436
## meter -0.0038408 0.0206467 -0.186 0.852428
## minuten -0.0210673 0.0079275 -2.657 0.007881 **
## nachbarschaft 0.0430704 0.0287822 1.496 0.134565
## neu 0.0544815 0.0244485 2.228 0.025868 *
## platz 0.0181049 0.0130044 1.392 0.163880
## potsdamer 0.0484570 0.0279157 1.736 0.082614 .
## private 0.0085439 0.0151098 0.565 0.571774
## schreibtisch -0.0937358 0.0297856 -3.147 0.001653 **
## sehenswürdigkeiten 0.0893435 0.0296974 3.008 0.002630 **
## station -0.0340410 0.0106558 -3.195 0.001403 **
## verkehrsanbindung -0.0603553 0.0290956 -2.074 0.038062 *
## wenige -0.0111087 0.0273491 -0.406 0.684615
## wenigen -0.0379286 0.0277297 -1.368 0.171395
## wohnen 0.0373002 0.0207518 1.797 0.072285 .
## einkaufsmöglichkeiten -0.0373260 0.0174262 -2.142 0.032214 *
## erreichbar -0.0154662 0.0156404 -0.989 0.322745
## fußläufig -0.0080428 0.0204400 -0.393 0.693967
## großes -0.0315261 0.0158179 -1.993 0.046273 *
## gute -0.0125170 0.0205344 -0.610 0.542162
## guter -0.0378440 0.0324059 -1.168 0.242902
## sofa -0.0328983 0.0191785 -1.715 0.086297 .
## stadtpark -0.0802948 0.0249240 -3.222 0.001278 **
## weitere -0.0405505 0.0326688 -1.241 0.214529
## hell 0.0020565 0.0216266 0.095 0.924242
## umgebung 0.0043583 0.0201797 0.216 0.829010
## bahnhof -0.0295209 0.0187778 -1.572 0.115944
## etage -0.0095915 0.0308690 -0.311 0.756021
## `m²` 0.0231938 0.0149752 1.549 0.121449
## paar -0.0344290 0.0320544 -1.074 0.282804
## stehen 0.0655138 0.0365249 1.794 0.072885 .
## verfügung -0.0615230 0.0265053 -2.321 0.020292 *
## calm -0.0273191 0.0263244 -1.038 0.299388
## centre -0.0199528 0.0181190 -1.101 0.270824
## get -0.0081989 0.0226305 -0.362 0.717137
## single -0.0892672 0.0277784 -3.214 0.001314 **
## altstadt 0.0641370 0.0244432 2.624 0.008701 **
## aufzug 0.0980977 0.0378626 2.591 0.009582 **
## bett -0.0274573 0.0150312 -1.827 0.067768 .
## cologne -0.0089123 0.0163580 -0.545 0.585881
## dom 0.0032825 0.0255171 0.129 0.897644
## geeignet 0.0127017 0.0219490 0.579 0.562804
## hauptbahnhof 0.0023639 0.0156212 0.151 0.879722
## hohe 0.0639593 0.0490541 1.304 0.192304
## kölner -0.0084563 0.0221481 -0.382 0.702611
## lan -0.0296429 0.0315850 -0.939 0.347997
## max 0.0599493 0.0270813 2.214 0.026866 *
## neben -0.0006954 0.0319421 -0.022 0.982632
## studio -0.0073350 0.0129771 -0.565 0.571931
## hamburgs 0.0065965 0.0263963 0.250 0.802665
## schnell -0.0208147 0.0224965 -0.925 0.354855
## zentrum 0.0305566 0.0153294 1.993 0.046242 *
## willkommen 0.0339360 0.0259116 1.310 0.190323
## cosy -0.0383129 0.0125535 -3.052 0.002277 **
## `next` 0.0203381 0.0183290 1.110 0.267182
## schöneberg -0.0713584 0.0233465 -3.056 0.002243 **
## take 0.0097825 0.0282978 0.346 0.729574
## gelegene -0.0238740 0.0327496 -0.729 0.466023
## mehr -0.0403367 0.0345839 -1.166 0.243494
## modern 0.0382764 0.0136429 2.806 0.005029 **
## unterkunft -0.0068480 0.0134993 -0.507 0.611965
## abenteurer -0.0889668 0.0368628 -2.413 0.015814 *
## adventurers 0.0168575 0.0694009 0.243 0.808086
## alleinreisende -0.1721441 0.0353601 -4.868 1.14e-06 ***
## amp -0.0041262 0.0079916 -0.516 0.605637
## business 0.0378505 0.0319553 1.184 0.236241
## couples -0.0285087 0.0333772 -0.854 0.393042
## geschäftsreisende 0.0769661 0.0276406 2.785 0.005367 **
## paare 0.0655560 0.0269033 2.437 0.014833 *
## solo -0.1037931 0.0654986 -1.585 0.113065
## night -0.0030257 0.0299931 -0.101 0.919647
## perfect 0.0002585 0.0168008 0.015 0.987726
## gemütlich -0.0467695 0.0201436 -2.322 0.020257 *
## hbf -0.0367380 0.0199481 -1.842 0.065542 .
## bahnstation -0.0510878 0.0316898 -1.612 0.106957
## bushaltestelle 0.0069445 0.0354083 0.196 0.844515
## stadtteil -0.0141114 0.0223330 -0.632 0.527487
## west 0.0060729 0.0221397 0.274 0.783859
## öffentliche -0.0093875 0.0380115 -0.247 0.804938
## bathroom -0.0361073 0.0173824 -2.077 0.037797 *
## clubs -0.0479126 0.0216011 -2.218 0.026566 *
## right 0.0223583 0.0187445 1.193 0.232970
## spacious 0.0429690 0.0149517 2.874 0.004061 **
## area 0.0043708 0.0135300 0.323 0.746665
## find 0.0080154 0.0221917 0.361 0.717962
## hip 0.0060453 0.0330285 0.183 0.854775
## quite -0.0529064 0.0267448 -1.978 0.047926 *
## sunny 0.0123350 0.0164643 0.749 0.453751
## transportation -0.0658683 0.0363833 -1.810 0.070254 .
## etc -0.0367728 0.0190565 -1.930 0.053666 .
## gehminuten 0.0462673 0.0168431 2.747 0.006022 **
## geschäfte -0.0328735 0.0307713 -1.068 0.285395
## kleiner -0.0283936 0.0325822 -0.871 0.383525
## kneipen -0.0186916 0.0306086 -0.611 0.541430
## ausgestattete 0.0521978 0.0294744 1.771 0.076589 .
## voll -0.0226327 0.0232909 -0.972 0.331195
## berg -0.0174849 0.0337473 -0.518 0.604387
## close -0.0190916 0.0111876 -1.706 0.087936 .
## large 0.0550549 0.0172666 3.189 0.001433 **
## rooms 0.0889164 0.0176288 5.044 4.62e-07 ***
## two 0.0089165 0.0137339 0.649 0.516201
## dennoch 0.0406930 0.0322213 1.263 0.206638
## essen 0.0167992 0.0357171 0.470 0.638117
## findet 0.0062441 0.0305108 0.205 0.837847
## tolle 0.0824318 0.0259009 3.183 0.001463 **
## besteht -0.0257565 0.0373742 -0.689 0.490739
## sonnige 0.0744645 0.0330986 2.250 0.024477 *
## wedding -0.1515361 0.0255795 -5.924 3.21e-09 ***
## centrally 0.0186651 0.0329933 0.566 0.571589
## design 0.1096467 0.0221811 4.943 7.77e-07 ***
## foot 0.0580122 0.0299922 1.934 0.053103 .
## metro -0.0124265 0.0187982 -0.661 0.508592
## near -0.0719854 0.0140390 -5.128 2.97e-07 ***
## please 0.0178387 0.0228277 0.781 0.434551
## renovated 0.0413242 0.0245117 1.686 0.091836 .
## towels 0.0230894 0.0331628 0.696 0.486286
## families 0.2178464 0.0385186 5.656 1.58e-08 ***
## drei 0.0635669 0.0295109 2.154 0.031255 *
## erreicht -0.0171078 0.0247659 -0.691 0.489714
## familien 0.1919201 0.0295049 6.505 8.03e-11 ***
## family 0.1775234 0.0314141 5.651 1.62e-08 ***
## ottensen 0.0160634 0.0220072 0.730 0.465453
## schön -0.0024359 0.0307289 -0.079 0.936818
## innenstadt 0.0293021 0.0156434 1.873 0.061070 .
## kindern 0.0837906 0.0367161 2.282 0.022496 *
## spülmaschine 0.0748320 0.0391964 1.909 0.056261 .
## place 0.0009569 0.0129594 0.074 0.941142
## plenty -0.0222648 0.0344811 -0.646 0.518478
## eigenem 0.0273029 0.0314128 0.869 0.384771
## haus 0.0236429 0.0185287 1.276 0.201969
## green -0.0639329 0.0244850 -2.611 0.009034 **
## neukölln -0.0498698 0.0122227 -4.080 4.53e-05 ***
## open 0.0056514 0.0283148 0.200 0.841802
## shower 0.0146515 0.0277023 0.529 0.596889
## space -0.0340906 0.0202161 -1.686 0.091757 .
## well -0.0473388 0.0157506 -3.006 0.002656 **
## badezimmer -0.0078532 0.0216824 -0.362 0.717213
## groß -0.0448007 0.0240030 -1.866 0.061997 .
## haltestelle -0.0191489 0.0264057 -0.725 0.468353
## kommt 0.0437180 0.0306922 1.424 0.154352
## rhein 0.0195469 0.0297851 0.656 0.511665
## zimmerwohnung 0.0314864 0.0245628 1.282 0.199907
## building 0.0330484 0.0215751 1.532 0.125597
## ferienwohnung 0.0935400 0.0214440 4.362 1.30e-05 ***
## grünen -0.0905092 0.0241958 -3.741 0.000184 ***
## inklusive 0.0340858 0.0330113 1.033 0.301831
## munich 0.0835892 0.0129136 6.473 9.91e-11 ***
## münchen 0.0825240 0.0172882 4.773 1.83e-06 ***
## terrace 0.1067716 0.0233735 4.568 4.96e-06 ***
## verkehrsmitteln 0.0334585 0.0427257 0.783 0.433581
## zahlreiche 0.0334719 0.0321144 1.042 0.297304
## genießen 0.0333937 0.0348082 0.959 0.337392
## immer 0.0080604 0.0322288 0.250 0.802515
## inkl 0.0082213 0.0284506 0.289 0.772610
## leben -0.0073690 0.0289317 -0.255 0.798956
## lieben 0.0176288 0.0351218 0.502 0.615723
## connection -0.0385095 0.0310545 -1.240 0.214972
## loft 0.1834140 0.0183816 9.978 < 2e-16 ***
## places -0.0154866 0.0313826 -0.493 0.621683
## welcome -0.0715157 0.0191831 -3.728 0.000194 ***
## altona -0.0060721 0.0196414 -0.309 0.757215
## hafen 0.0150357 0.0303043 0.496 0.619791
## kleines -0.1022711 0.0235241 -4.348 1.39e-05 ***
## pauli 0.0415730 0.0173740 2.393 0.016732 *
## can -0.0050004 0.0116228 -0.430 0.667040
## charlottenburg -0.0795109 0.0206600 -3.849 0.000119 ***
## coffee 0.0646154 0.0306576 2.108 0.035079 *
## double -0.0039472 0.0196940 -0.200 0.841149
## home 0.0013945 0.0161026 0.087 0.930992
## light -0.0219643 0.0231560 -0.949 0.342874
## living 0.1081499 0.0154779 6.987 2.92e-12 ***
## lots 0.0197867 0.0257619 0.768 0.442464
## love 0.0040388 0.0244562 0.165 0.868833
## machine 0.0494412 0.0442931 1.116 0.264343
## market 0.0551051 0.0342465 1.609 0.107621
## new 0.0561093 0.0197141 2.846 0.004431 **
## person -0.0697206 0.0211931 -3.290 0.001005 **
## reach -0.0454023 0.0243221 -1.867 0.061962 .
## shopping 0.0001910 0.0203827 0.009 0.992525
## sleeping 0.0796198 0.0272577 2.921 0.003494 **
## street 0.0014187 0.0187344 0.076 0.939638
## washing -0.1013254 0.0497157 -2.038 0.041558 *
## comfy -0.1255172 0.0305184 -4.113 3.93e-05 ***
## entspannen 0.0147078 0.0322403 0.456 0.648257
## kiez -0.0311071 0.0217257 -1.432 0.152220
## seid 0.0235401 0.0276767 0.851 0.395039
## einfach -0.0530529 0.0287035 -1.848 0.064578 .
## geräumige 0.0349930 0.0349642 1.001 0.316929
## supermarkt -0.0221015 0.0277757 -0.796 0.426211
## schwabing 0.1033279 0.0211311 4.890 1.02e-06 ***
## perfekte 0.0321097 0.0317770 1.010 0.312287
## ausblick 0.0551619 0.0296221 1.862 0.062596 .
## couch -0.0712591 0.0183138 -3.891 0.000100 ***
## hallo -0.0194405 0.0290184 -0.670 0.502908
## neighborhood 0.0148674 0.0219497 0.677 0.498201
## tür 0.0353191 0.0189795 1.861 0.062777 .
## wunderschöne 0.0826886 0.0251620 3.286 0.001018 **
## style 0.0320481 0.0318602 1.006 0.314480
## couple 0.0036270 0.0340901 0.106 0.915271
## feel -0.0021733 0.0268665 -0.081 0.935529
## stylish 0.1152361 0.0289179 3.985 6.78e-05 ***
## supermärkte -0.0331059 0.0249055 -1.329 0.183782
## best 0.0786466 0.0198424 3.964 7.42e-05 ***
## hamburger 0.0074455 0.0236326 0.315 0.752728
## bitte 0.0390429 0.0238507 1.637 0.101657
## eingerichtete 0.0512278 0.0281995 1.817 0.069295 .
## elbe -0.0041152 0.0234557 -0.175 0.860733
## flur -0.0310269 0.0330494 -0.939 0.347846
## genutzt -0.0409369 0.0300114 -1.364 0.172574
## großer 0.0224288 0.0251077 0.893 0.371710
## herzlich -0.0487032 0.0382508 -1.273 0.202946
## leute -0.0178801 0.0315529 -0.567 0.570947
## liebevoll -0.0135968 0.0342997 -0.396 0.691807
## schlafen 0.0244289 0.0311726 0.784 0.433249
## tollen -0.0020099 0.0357569 -0.056 0.955174
## vielen 0.0360216 0.0235108 1.532 0.125513
## wohnküche 0.0110719 0.0274855 0.403 0.687081
## aufenthalt 0.0058531 0.0271748 0.215 0.829468
## like -0.0188466 0.0219959 -0.857 0.391558
## live -0.0359883 0.0261743 -1.375 0.169169
## separate 0.0073936 0.0280202 0.264 0.791886
## clean -0.0223431 0.0235213 -0.950 0.342174
## dusche 0.0040944 0.0217994 0.188 0.851018
## helles -0.0527355 0.0188620 -2.796 0.005183 **
## maisonette 0.1363753 0.0258500 5.276 1.34e-07 ***
## schlafsofa 0.0341525 0.0288134 1.185 0.235918
## zentralen 0.0220100 0.0341652 0.644 0.519440
## ideal -0.0011295 0.0192521 -0.059 0.953215
## raum -0.0609058 0.0262473 -2.320 0.020329 *
## amazing 0.0864382 0.0322674 2.679 0.007397 **
## marienplatz 0.0858374 0.0211980 4.049 5.16e-05 ***
## mins -0.0465658 0.0171212 -2.720 0.006540 **
## stops 0.0097886 0.0312474 0.313 0.754087
## travelers -0.0320675 0.0403124 -0.795 0.426350
## zeit -0.0190542 0.0276181 -0.690 0.490258
## fair 0.0959029 0.0221523 4.329 1.51e-05 ***
## parkplatz 0.0475967 0.0349451 1.362 0.173206
## fährt -0.0848064 0.0327125 -2.592 0.009538 **
## unserer -0.0019063 0.0249059 -0.077 0.938991
## wohn 0.0148033 0.0280941 0.527 0.598258
## ubahn -0.0054081 0.0237566 -0.228 0.819925
## wegen -0.0294263 0.0373734 -0.787 0.431084
## fußweg 0.0249783 0.0206861 1.207 0.227262
## straßenbahn -0.0075588 0.0312250 -0.242 0.808726
## charmante 0.0519098 0.0303902 1.708 0.087637 .
## sternschanze 0.0174594 0.0268229 0.651 0.515112
## kaffee 0.0118242 0.0378447 0.312 0.754711
## apt 0.1550890 0.0247717 6.261 3.94e-10 ***
## old -0.0316271 0.0217917 -1.451 0.146707
## schanze 0.0203551 0.0227707 0.894 0.371382
## szeneviertel 0.0053815 0.0344135 0.156 0.875738
## trotzdem 0.0149152 0.0292087 0.511 0.609609
## biete -0.0554153 0.0281747 -1.967 0.049220 *
## großen 0.0169452 0.0182168 0.930 0.352283
## beliebten 0.0070505 0.0297871 0.237 0.812895
## schlafcouch -0.0393363 0.0248342 -1.584 0.113224
## whg 0.0858197 0.0250940 3.420 0.000628 ***
## dining 0.1040366 0.0377715 2.754 0.005888 **
## persons 0.0121056 0.0290650 0.417 0.677049
## supermarket -0.0903169 0.0312759 -2.888 0.003886 **
## table -0.0388291 0.0350301 -1.108 0.267686
## within -0.0052036 0.0205301 -0.253 0.799916
## art 0.0485094 0.0277438 1.748 0.080402 .
## läden -0.0693200 0.0372139 -1.863 0.062518 .
## freunde 0.0018007 0.0373162 0.048 0.961513
## gegenüber 0.0884683 0.0316014 2.800 0.005125 **
## könnt -0.0073215 0.0251126 -0.292 0.770637
## decken -0.0096048 0.0478037 -0.201 0.840763
## hohen 0.0082189 0.0493542 0.167 0.867742
## east -0.0589399 0.0365483 -1.613 0.106841
## alexanderplatz -0.0013642 0.0173007 -0.079 0.937151
## sleep -0.0822790 0.0305517 -2.693 0.007087 **
## gegend -0.0830992 0.0351326 -2.365 0.018028 *
## nette -0.0502778 0.0297856 -1.688 0.091434 .
## str 0.0480034 0.0303332 1.583 0.113548
## lot 0.0094123 0.0249370 0.377 0.705849
## vermieten 0.0508696 0.0262474 1.938 0.052632 .
## linie -0.0360552 0.0277221 -1.301 0.193419
## time -0.0724765 0.0237121 -3.057 0.002243 **
## flughafen -0.0189658 0.0206114 -0.920 0.357502
## tegel -0.0740299 0.0355558 -2.082 0.037353 *
## parkplätze -0.0497401 0.0359865 -1.382 0.166934
## ruhiger -0.0434152 0.0280546 -1.548 0.121758
## kleinen -0.0531419 0.0223141 -2.382 0.017253 *
## bieten -0.0011059 0.0262972 -0.042 0.966457
## möglich 0.0097431 0.0298723 0.326 0.744308
## windows -0.0252808 0.0334982 -0.755 0.450446
## gäste 0.0386968 0.0204940 1.888 0.059018 .
## weit -0.0869000 0.0351455 -2.473 0.013425 *
## offer -0.0200518 0.0267544 -0.749 0.453582
## share -0.1108477 0.0292826 -3.785 0.000154 ***
## bedrooms 0.3374616 0.0315470 10.697 < 2e-16 ***
## ruhiges -0.0618790 0.0235110 -2.632 0.008499 **
## tag 0.0209486 0.0324791 0.645 0.518945
## free 0.0102511 0.0212293 0.483 0.629191
## full -0.0259025 0.0264483 -0.979 0.327416
## berliner 0.0158723 0.0211752 0.750 0.453527
## guest 0.0389975 0.0307485 1.268 0.204720
## huge 0.0501205 0.0251305 1.994 0.046126 *
## nachtleben -0.0265521 0.0361239 -0.735 0.462334
## kultur -0.0036039 0.0381981 -0.094 0.924834
## wohlfühlen -0.0158425 0.0321484 -0.493 0.622165
## stationen -0.0283611 0.0252611 -1.123 0.261575
## available 0.0176864 0.0242992 0.728 0.466711
## lively 0.0519310 0.0348526 1.490 0.136241
## renting 0.0223174 0.0348973 0.640 0.522497
## use 0.0007842 0.0225515 0.035 0.972262
## isar 0.0291499 0.0234130 1.245 0.213141
## parking 0.0127584 0.0311613 0.409 0.682230
## main 0.0095301 0.0196461 0.485 0.627622
## enjoy -0.0164929 0.0209865 -0.786 0.431950
## want -0.0007066 0.0279661 -0.025 0.979843
## desk -0.1243197 0.0367558 -3.382 0.000721 ***
## guests 0.0078682 0.0256251 0.307 0.758809
## shared -0.1426690 0.0203152 -7.023 2.27e-12 ***
## frisch -0.0333642 0.0346415 -0.963 0.335500
## renoviert -0.0094209 0.0381375 -0.247 0.804893
## mauerpark -0.0294266 0.0259880 -1.132 0.257520
## ceilings 0.0061717 0.0404690 0.153 0.878791
## high 0.0718792 0.0272675 2.636 0.008396 **
## finden 0.0021926 0.0313633 0.070 0.944267
## steht 0.0302083 0.0300388 1.006 0.314604
## explore -0.0657959 0.0347022 -1.896 0.057977 .
## downtown 0.0469254 0.0330444 1.420 0.155608
## happy 0.0655928 0.0353183 1.857 0.063304 .
## river 0.0017274 0.0273660 0.063 0.949671
## extra 0.0233837 0.0288801 0.810 0.418136
## bathtub 0.0358605 0.0348730 1.028 0.303817
## trendy 0.0524130 0.0312031 1.680 0.093029 .
## day -0.0288107 0.0324654 -0.887 0.374864
## direct -0.0231888 0.0308943 -0.751 0.452913
## english -0.0772021 0.0315277 -2.449 0.014348 *
## looking -0.0490750 0.0356830 -1.375 0.169058
## bequem 0.0238798 0.0364884 0.654 0.512833
## ehrenfeld -0.0326015 0.0189503 -1.720 0.085386 .
## short -0.0781991 0.0359403 -2.176 0.029585 *
## grüne -0.0233250 0.0344461 -0.677 0.498324
## fast -0.0285599 0.0293243 -0.974 0.330106
## per -0.0590277 0.0296820 -1.989 0.046756 *
## directly -0.0150191 0.0288145 -0.521 0.602210
## train -0.0271745 0.0194260 -1.399 0.161872
## bath 0.0429965 0.0298548 1.440 0.149836
## esstisch 0.0734508 0.0358507 2.049 0.040499 *
## gleich 0.0073776 0.0335368 0.220 0.825885
## size -0.0096066 0.0290500 -0.331 0.740882
## habt 0.0119305 0.0337933 0.353 0.724060
## süd 0.0226630 0.0323521 0.701 0.483619
## rent 0.0174866 0.0267008 0.655 0.512536
## fragen -0.0261660 0.0340773 -0.768 0.442593
## easily 0.0084806 0.0348679 0.243 0.807837
## friendly -0.0757337 0.0280642 -2.699 0.006971 **
## allee -0.0630236 0.0301682 -2.089 0.036718 *
## ganz 0.0021061 0.0237195 0.089 0.929248
## innerhalb -0.0481883 0.0288435 -1.671 0.094805 .
## kannst -0.0732117 0.0278779 -2.626 0.008644 **
## check -0.0180616 0.0219445 -0.823 0.410486
## stop -0.0002318 0.0325200 -0.007 0.994313
## gästezimmer -0.1661195 0.0232591 -7.142 9.62e-13 ***
## museum 0.0386519 0.0280490 1.378 0.168220
## bar 0.0641080 0.0321427 1.994 0.046117 *
## frankfurter 0.0249871 0.0335047 0.746 0.455813
## natürlich -0.0343196 0.0314490 -1.091 0.275169
## eigenes -0.0146029 0.0330543 -0.442 0.658652
## stuttgart -0.0633765 0.0204877 -3.093 0.001982 **
## frankfurt 0.0048621 0.0142180 0.342 0.732381
## dresden -0.1346446 0.0196418 -6.855 7.41e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4526 on 14880 degrees of freedom
## Multiple R-squared: 0.3557, Adjusted R-squared: 0.3304
## F-statistic: 14.02 on 586 and 14880 DF, p-value: < 2.2e-16
F stats: I can reject the null hypothesis that all of the regression coefficients are equal to zero (p-value < 0.01).
Adjusted R-squared: Nearly \(32\%\) of the variance of the price can be explained by our model.
Coefficients
The inevitable multicollinearity makes individual parameters difficult to interpret. However, it is still a good exercise to look at the most important coefficients to see if they make intuitive sense in the context of a particular application. “Most important” can be defined in a number of ways. Here, I will rank the estimated coefficients by their absolute value.
The plot below show all terms with a p-value < 0.01.
library(broom)
p <- tidy(fit.lm) %>%
filter(p.value < 0.01) %>%
filter(term != "(Intercept)") %>%
mutate(pos = factor(ifelse(estimate>=0,1,0))) %>%
#top_n(20,estimate) %>%
ggplot(aes(reorder(term, estimate),estimate,
fill = pos)) +
geom_col(show.legend = F, alpha = 0.8) +
coord_flip() +
scale_fill_manual(values = c(col[1],col[2])) +
labs(x="Estimate", y="", title ="Coefficients with p<0.01")
ggsave("../figs/coefficientslm.png", p, height = 14)
par(mfrow=c(2,2))
plot(fit.lm)
library(glmnet)
Ridge Regression is a regularization method that tries to avoid overfitting. Like OLS, ridge attempts to minimize residual sum of squares of predictors in a given model. However, ridge regression includes an additional “shrinkage”" allowing some coefficients with minor contribution to the response to get close to zero. In a ridge regression the following term will be minimized:
\[ \text{RSS}(\beta) + \lambda \sum^p_{j=1}\beta^2_j \]
From this we can easily see, that:
\(\lambda = 0\): coefficients equal OLS coefficients
\(\lambda = \infty\): coefficients approach zero
#fit the model
fit.ridge <- cv.glmnet(x.train, y.train, family='gaussian', alpha=0)
The above code, performs 10-fold cross validation to choose the best \(\lambda\). Moreover, it estimates a linear regression (family = “gaussian”).
#Results
plot(fit.ridge)
Minimum \(\lambda\):
lam_min <- fit.ridge$lambda.min
lam_min
## [1] 0.09945955
Similar to the Ridge Regression, Lasso also tries to avoid overfitting. The difference is, that it uses the L1 Norm to penalize large coefficients (Lasso is aka L1 Regularization):
\[ \text{RSS}(\beta) + \lambda \sum^p_{j=1}|\beta_j| \]
Lasso can be used to perform variable selection, as it can shrink some of the coefficients to exactly zero.
# Fitting the model (Lasso: Alpha = 1)
fit.lasso <- cv.glmnet(x.train, y.train, family='gaussian',
alpha=1)
The above code, performs 10-fold cross validation to choose the best \(\lambda\). Moreover, it estimates a linear regression (family = “gaussian”).
#Results
plot(fit.lasso)
Minimum \(\lambda\):
lam_min <- fit.lasso$lambda.min
lam_min
## [1] 0.002832644
rmse <- function(error) {
sqrt(mean(error^2))
}
How well do the models perform when predicting the test data?
Linear Regresion
pred.lm <- as.data.frame(predict(fit.lm,
newdata = df.test %>% select(-log_price)))
# Combine predictions with test dataframe
df.test$pred.lm <- pred.lm[,1]
df.test$error.lm <- df.test$log_price - df.test$pred.lm
rmse.lm <- rmse(df.test$error.lm)
print(paste0("The RMSE of the Linear Regression is: ",
rmse.lm))
## [1] "The RMSE of the Linear Regression is: 0.474996653299447"
Ridge Regression
pred.ridge <- as.data.frame(predict(fit.ridge, x.test))
# Combine predictions with test dataframe
df.test$pred.ridge <- pred.ridge[,1]
df.test$error.ridge <- df.test$log_price - df.test$pred.ridge
rmse.ridge <- rmse(df.test$error.ridge)
print(paste0("The RMSE of the Ridge Regression is: ",
rmse.ridge))
## [1] "The RMSE of the Ridge Regression is: 0.475314709569063"
LASSO
pred.lasso <- as.data.frame(predict(fit.lasso, x.test))
# Combine predictions with test dataframe
df.test$pred.lasso <- pred.lasso[,1]
df.test$error.lasso <- df.test$log_price - df.test$pred.lasso
rmse.lasso <- rmse(df.test$error.lasso)
print(paste0("The RMSE of the LASSO is: ",
rmse.lasso))
## [1] "The RMSE of the LASSO is: 0.475182995049206"
A plot of the predicted values agains the actual values shows the explanatory power of the prediction models.
df.test %>%
select(pred.lm, pred.ridge, pred.lasso,
log_price) %>%
tidyr::gather(model, "predictions", pred.lm:pred.lasso) -> plot
p <- df.test %>%
ggplot(aes(pred.lm, log_price)) +
geom_point(alpha = 0.8) +
geom_smooth(method = lm) +
labs(x="Predicted y", y="Actual y",
title = "Predicted vs. True Values",
subtitle = "OLS with Text Data",
caption = paste0("RMSE: ", round(rmse.lm,3)))
ggsave("../figs/residplot2.png", p)
p <- df.test %>%
ggplot(aes(pred.ridge, log_price)) +
geom_point(alpha = 0.8) +
geom_smooth(method = lm) +
labs(x="Predicted y", y="Actual y",
title = "Predicted vs. True Values",
subtitle = "Ridge Reg with Text Data",
caption = paste0("RMSE: ", round(rmse.ridge,3)))
ggsave("../figs/residplot3.png", p)
p<- df.test %>%
ggplot(aes(pred.lasso, log_price)) +
geom_point(alpha = 0.8) +
geom_smooth(method = lm) +
labs(x="Predicted y", y="Actual y",
title = "Predicted vs. True Values",
subtitle = "LASSO with Text Data",
caption = paste0("RMSE: ", round(rmse.lasso,3)))
ggsave("../figs/residplot4.png", p)
Based on the RMSE, it can be concluded that LASSO performs best when I use the description text as exogenous variables. However, compared to the regression with structural variables from part two, LASSO performs worse.