Exploratory Data Analysis & Booking Cancelation Prediction on Hotel Booking Demands Datasets

Online ordering is the latest breakthrough in the hospitality industry, but when it comes to booking cancellations, it has a negative impact on it. To reduce and anticipate an increase in the number of booking cancellations, we developed a booking cancellations prediction model using machine learning interpretable algorithms for hotels. Both models used Random Forest and the Extra Tree Classifier share the highest precision ratios, Random Forest on the other hand has the highest recall ratio, this model predicted 79% of actual positive observations. These results prove that it is possible to predict booking cancellations with high accuracy. These results can also help hotel owners or hotel managers to predict better predictions, improve cancellation regulations, and create new tactics in business


Introduction
'Application of information systems and pricing techniques to assign the correct resources at the right time to the right customer at the right price' is referred to as revenue management [1]. Originally developed in the aviation industry in 1966, it has increasingly been adopted in other service sectors, such as rental cars, golf courses, and in particular, hotels [2] [1]. The former definition was updated by the hospitality industry to: "make available the right space for the right guest and the right price via the right distribution channel at the right time" [3]. Hotels allow reservations online in order to cope with this. The reservation is a contract between the customer and the hotel which gives the customer the right to use the facility at a mutually agreed price in the future. Usually, until the terms of services are included, there is an opportunity or option to revoke the deal or reservation. However the decision to cancel the service prior to its provision puts all the responsibility on the hotel, which must guarantee the customer who made the reservation the availability of the room.
While reservations normally allow guests to revoke a room with and without penalty until both the provision of products and services, hotels should risk the possibility of guaranteeing rooms for customers that meet with their reservations. Hotels could also bear the cost of having vacant rooms at about the same time [4] when guests cancel or do not appear. Hotels introduce overbooking and restrictive cancellation policies to minimize this risk [5], And yet both overbooking and strict cancellation policy can damage the performance of hotels. Overbooking, on the other hand, might compel the hotel to refuse a client a service. This can be a really poor customer experience which can lead to online reviews and a negative effect on social reputation [6]. This relocation may also carry the client to a hotel he/she may like and lead the client to lose future reservations [3].  [4,5,7]. The majority of studies focus primarily on the airline industry, which is quite different from the hospitality market. But, lately, the number of studies related to the hospitality industry has increased. The majority of research used the conventional methods of statistics and just a few benefits of the methodology and techniques of machine learning. Currently four studies are unique to the hospitality industry, considering the presence of many studies on the subject [8,9,10,11]. Most studies consider it as a regression problem in the prediction of booking cancellations. Just a couple of the recently published studies address the subject as a classification problem and focus on forecasting the global cancellation rate rather than the risk of each booking being cancelled. In fact, Morales and Wang [11] stated that it is difficult to believe that one can determine with high precision whether or not a booking would be canceled. António, however, showed that the probability of booking cancellations can be predicted with high precision [8] [12]. The percentage of all bookings expected to cancel can be deducted from the demand to calculate the hotel's net demand, i.e. the demand minus bookings that are likely to cancel. Designed with a particular market value, the revenue manager of a hotel will make fair and reasonable choices about demand control and develop overbooking strategies and cancellation policies.

Literature Review
Mehrotra [3] described that a key aspect of revenue management is good demand forecasting. Talluri [13] also acknowledged the importance of forecasting in revenue management by stating that revenue management systems require quantity forecasts and that the quality of these forecasts is critically dependent on their performance. The demand forecast was described by authors such as Ivanov [14] or Morales [11] as one of the aspects where forecasting is relevant. Booking cancellations are behind this need to forecast demand, since they do not reflect the actual demand for their services in the hospitality sector, as in other service sectors that deal with advanced bookings, because there are also a large number of cancellations.
Booking cancellations are common in revenue management as it relates, in particular, to the service and hospitality industries. Nevertheless the Internet has increasingly changed the way customers search and buy travel services in recent years. In particular research on control-related problems used to mitigate the effect of cancellations on sales and inventory actual delivery, cancellation policy, and overbooking has increased in this area generally. However there seems to be no amount of studies of booking cancellation predictions for the hospitality industry.

Exploratory Data Analysis (EDA) & Feature Engineering
EDA or exploratory data analysis is a technique for fitting linear and higher functions to relationships, for structuring and transforming variables with arithmetic functions, for splitting relationships into partitions and clusters, for extracting features through statistical results and such. One example of the results of EDA is a simple histogram that describes discrete and continuous variables, schematic plots that provide general & partial relationship characteristics that distinguish 2 habits, simplification of functions from low dimensional relationships and two-way tables such as contingency tables.
On this part, we would like to visualize some features and show statistical relationship with target variable. This analysis will help to get overall view and deep familiarity of the data, detect extreme values and identify obvious errors. First graph fig. 1 is about exploring hotel features which denotes the type of the hotels. According to the below graph, approximately 34% of the data was booked for resort hotels and the rest was  According to that, August is the busiest month and January is the most unoccupied month. It is half as busy as August. Another important features which are related to time are stays_in_week_nights and stays_in_weekend_night features. The below table shows the relationship between these two features. According to that, there is some missing data. 715 values are inputted zero both weekend and weeknights. However, this missing data is small enough to neglect.     (Fig. 8), another exploratory analysis made for diving deep into the relationship between ADR, arrival month, and booking cancellation status. As explained in the previous graph of arrival month (Fig. 4) The last graph is about the relationship between special requests and cancellation booking status. Nearly half bookings without any special requests have been canceled and another half of them have not been canceled. Figure. 9. Total special request vs Booking cancellation status

Dealing with missing data
On the first part, if there any missing data will be checked. Such that, the company feature's 94% is missing. Because of that, this feature will be eliminated. On the other hand, the children and all_children features have only 4 missing data. This missing data will replace it with zero. Another missing data has occurred in country and agent features. Since missing data of a country is less than 1%, these data will be replaced with the most frequent value. However, the agent's missing features are more than the country. For this feature, missing data will be imputed as 0.
Another part is analyzing categorical features. Categorical labels converted into numerical form. This will help to be more understandable and implementable into machine learning algorithms. Some features are not ordinal such as country. In that case, One-Hot Encoding could be chosen. Due to the high number of categories, this method could incur higher computational cost. To help reduce that, a Label Encoding method will be used.
After encoding the categorical data, two data frames will be created. One data frame has only categorical data and another has numerical data. These two different data frames will be used to create a correlation matrix. Spearman method will be used for categorical data correlation matrix and Pearson method will be used for numerical one. The above correlation matrix shows positive or negative relationships between them. In those two heatmaps, the reservation_ status feature is drawn more attention because of its negative correlation with the is_canceled feature. The below table shows the relationship with details. This high correlation can cause a wrong prediction or overfitting. Prevent this situation, reservation_status feature will be eliminated. On the other hand, there is another high relationship between the children and the all_children features, since the all_children feature is constituted with the children and the babies features. Therefore the children feature will be eliminated too. Last feature (Table. 3) is reservation_status_date. Since this feature includes date type data and it could not convert another type, this feature will be eliminated. training part more accurately. Therefore, hyperparameters tuning will be fixed before the training process. Another important work is constituted Permutation Feature Importance graph with the Extreme Gradient Boosting algorithm. This technique calculates feature importance and performance metric to be chosen as the basis of the accuracy score. This graph will help to understand features' contributions to prediction, provide insight into the dataset, and will help to find deemed non-important features if any. The graph below (Fig. 12) shows the feature importance of the features. According to that, 1 out of 29 features are not important to prediction which is babies. It will be eliminated. Figure • Accuracy is a ratio of correct predictions to the total predictions. Its formula is: According to that, Random Forest has the highest correct prediction with 88%. Another performance metrics explained below: • Precision: It is the ratio of correctly predicted observation to the total positive predicted observation. Its formula is

TP/(TP+FP)
• Recall: It is the ratio of correctly predicted positive observations to the actual positive observations. Its formula is

Results and Discussion
As we can see from the summary table, Random Forest and the Extra Tree Classifier share the highest precision ratios. It means that both models predicted around 88% of all the positive labels correctly. On the other hand Random Forest has the highest recall ratio. It means that this model predicted 79% of actual positive observations correctly.
These studies confirm that bookings with a high risk of being cancelled can be detected. This encourages hotel management to take steps to deter possible cancellations, such as improved facilities, discounts or other incentives, from being provided. However since others are oblivious to this sort of bid, this should not be extended to all consumers. However this prediction model has much to be learned from building and deploying it. More relevant results, such as the amount of room nights expected to also be cancelled in any of the following weeks, can be achieved by running the model against all bookings every day. Hotel operators, estimating their net demand, will deduce this value from their demand. Hotel managers may build stronger overbooking and cancelation strategies if presented with a more specific value of net demand, resulting in lower costs and reduced risk.
As with every other issue in predictive analytics, designing a model for forecasting booking cancellations requires data to follow all of the quality data characteristics: precise, consistent, neutral, relevant, acceptable and timely [15][16]. Any of the data sets had variables with incorrect values, as previously stated. Model output may be influenced by this lack of consistency. For this cause, hotels who wish to build prediction models need to ensure that a policy on data quality is in effect.

Conclusion
This study aims to explore some of the functions of predictive analytics in scientific analysis from a scientific point of view, including defining which features lead to predicting the likelihood of book cancellation. The implementation of data visualization and data analytics techniques, along with the use of the shared knowledge filter, made it possible to recognize the predictive significance of a feature. Different features were found to vary in value depending on the hotel, and certain features for some of the hotels are not needed.
We need to realize that one model may be extended to all hotels by developing a model to identify bookings that are likely to be cancelled and by building a stronger net demand forecast. The creation of the model found that features according to the hotel had different weights and different significance, meaning that every model did not match all hotels and thus each hotel will have its own version. These forecasting models allow business owners to reduce the loss of revenue from booking cancellations and to minimize the risks associated with overloading (relocation costs, cash or service compensations, and, particularly important today, social reputation costs). Booking cancellation models often allow hotel managers, without increasing confusion, to enforce less strict reservation policies.