HU Berlin Statistic Presentation
Dennis Köhn
Last Updated
7 лет назад
Creative Commons CC BY 4.0
Slide example to hold a presentation at the chair of statistics at HU Berlin.
\frametitle{Formal Problem Setting}
\item \textit{training set}: inputs $X = (x_1,\dots,x_n) \in \mathbb{R}^{n \times d}$ and labels $Y = (y_1,\dots,y_n) \in \mathbb{R}^{n}$
\item \textit{test set}: inputs $X' = (x'_1,\dots,x'_t) \in \mathbb{R}^{t \times d}$ without labels
Find a function
f: X\rightarrow Y
s.t. the \textit{test set} labels are predicted as accurately as possible, i.e.
f(X') \approx Y'
\item Introduction \quad \checkmark
\item Pre-processing Steps
\item Model Selection
\item Variable Importance and Dimensionality Reduction
\item Results and Conclusion
Several transformations and cleaning steps needed before putting the data into an algorithm, e.g.
\caption{Workflow of Pre-Processing Steps}
All transformation need to be preformed on the test set as well!
basicstyle=\tiny, %or \small or \footnotesize etc.
basic_preprocessing = function(X_com, y, scaler="gaussian")
X_ratings = replace_ratings(X_com)
X_imputed = naive_imputation(X_ratings)
X_no_outlier = data.frame(lapply(X_imputed, iqr_outlier))
X_time_encoded = include_quarter_dummies(X_no_outlier)
X_scaled = scale_data(X_time_encoded, scale_method = scaler)
X_encoded = data.frame(lapply(X_scaled, cat_to_dummy))
X_com = delect_nz_variable(X_encoded)
idx_train = c(1:length(y))
train = cbind(X_com[idx_train, ]
test = X_com[-idx_train, ]
return(list(train = train, X_com = X_com, test = test))
\quantnet \href{}{dataProcessing}
\section{Model Selection}
\item Introduction \quad \checkmark
\item Pre-processing Steps \quad \checkmark
\item Model Selection
\item Variable Importance and Dimensionality Reduction
\item Results and Conclusion
\frametitle{Optimizing Hyper-parameters}
\ForEach{i in 1:t}
Randomly split the data into k folds of the same size \\
\ForEach{j in 1:k}
Use $j$th fold as test set and the union of remaining folds as training set \\
\ForEach{p in 1:grid}
Fit model on training set using parameter set $p$ \\
Predict on test set and calculate RMSE
}%end inner for
\ForEach{p in 1:grid}{
Calculate average RMSE over the $t \times k$-runs
choose $p$ with the lowest RMSE
}%end oute and r for
\caption{t-time k-fold crossvalidation and gridSearch}
\quantnet \href{}{xgbTuning}
\quantnet \href{}{rfTuning}
\quantnet \href{}{svmTuning}
\frametitle{Taking on the curse of Dimensionality}
\item many variables (99 after pre-processing)
\item small training set ($n = 1460$)
\item variables are correlated with each other
Our approaches:
\item Variable selection through variable importance ranking
\item Extract a smaller set of variable using PCA
\section{Results and Conclusion}
\item Introduction \quad \checkmark
\item Pre-processing Steps \quad \checkmark
\item Model Selection \quad \checkmark
\item Variable Importance and Dimensionality Reduction \quad \checkmark
\item Results and Conclusion
\item Gaussian SVR with all variable is the single best model
\item PCA did not work well
\item Models perform best with the full set of variables as Figure \ref{fig:RFE} suggested
Inputs & Gaussian SVR & Random Forest & GBM \\
All Variables & \textbf{0.1308} & 0.1484 & 0.1333 \\
Top 30 & 0.1323 & 0.1515 & 0.1436 \\
PCA & 0.1607 & 0.1657 & 0.1657 \\
\caption{RMSE of submitted predictions}
\hspace{7.2cm} \href{}{Github: finalModels}
\item Introduction \quad \checkmark
\item Pre-processing Steps \quad \checkmark
\item Model Selection \quad \checkmark
\item Variable Importance and Dimensionality Reduction \quad \checkmark
\item Results and Conclusion \quad \checkmark
