Influence of Social Overhead Capital Facilities on Housing Prices Using Machine Learning
Round 1
Reviewer 1 Report
The paper presents a machine learning-based method for forecasting housing prices, relying on the vicinity of the properties to station proximity areas and Social Overhead Capital (SOC) facilities. The method and its reasoning are well-described and motivated. There are however some minor issues:
1. The abbreviation SOC is used before it is introduced later in the Introduction (indeed, it is also used in the title) - this is confusing.
2. The terms `station areas' and `station proximity areas' are used interchangeably - please unify the two terms throughout the manuscript. Also consider using an abbreviation (e.g., SPA).
3. In lines 112--113, it is mentioned that the work in [46] has achieved a `commendable rank on Kaggle.com's public leaderboard' - why is this important?
4. In line 145, the term `Living Social Overhead Capital (SOC)' is used, while in line 168, the term `Traffic Social Overhead Capital (SOC)' is used. Are there different categories of SOC? These should be listed out for reference, and their abbreviations defined separately and accordingly.
5. It is stated that missing data for 2022 passenger volume is supplemented with those from the 2021 dataset. Given that there was a significant shift in governmental policies (and indeed, governments from the changing presidency in the middle of 2022), is this reasonable? Is this the only instance of data being supplemented from previous time periods?
6. Some reference should be given to the KNN imputer method and scikit-learn.
Author Response
Please see the attachment
Author Response File: Author Response.pdf
Reviewer 2 Report
This paper is very interesting and well written, however it focuses too little on the ML and input data / result description. I am not also sure whether the conclusions are accurate, in my opinion some more data analysis are required or the conclusions should be more restrained.
I have a problem how much the regression algorithms are machine learning ones – they are, but they are very basic ones. Therefore in my opinion you should enumerate used methods in the abstract!
> 34: Predictive modeling for housing prices in South Korean station areas is a critical area of
research that can provide numerous benefits.
In my opinion the authors underestimate the importance of the price prediction for stations / road (railway, motorway, traffic jam) / infrastructure (electricity, water, sewage system supply), schools. For example, this paper may be used in order to justify the new railway and other infrastructure investments – which in my opinion is almost as important as the housing / real estate’s market price prediction. I do not know the tax policy in Korea but in many counties, an increase of the cadastral tax can justify the government / local authorities investments in the infrastructure. In some countries, residents must pay extra tax for the real estate price rise after a new infrastructure has been built. Therefore this kind of simulation is crucial for policymakers.
> 212 Following the variable elimination, we then proceeded with the log-scaling of our target
variable, ‘DLNG_AMOUNT’.
Could you please use full name before the first use: e.g. Dealing amount (DLNG_AMOUNT)
Error evaluation:
> Consequently, we selected RMSLE (Root Mean Squared Log Error) over RMSE.
Could you commend this. In my opinion you should use relative RMSE:
RMSE= SQRT ( AVG ( ( pi / ai )^2 )) ).
It is very difficult to illustrate RMSLE but in first approximation it is a relative error. So your average relative error is 22% - it is a relatively bad prediction. And in my opinion it is the main drawback of this paper! Therefore you should try to somehow improve the results!
My suggestion is that you better preprocess the input data. For example
The objective value should not be a log(DLNG_AMOUNT) but the Square Meter Unit Price (SMUP), maybe log(SMUP). It is commonly known that the real estate price is roughly proportional to its surface. The price estimation problem might be with a large surface real estates as they might be luxury one – try to comment this or show where prediction errors are the greatest (e.g. for large surface estate the error is the greatest as this might be either large (poor) family or luxury single person home.
The log-scaling might be a good solution as log(a*b) = log(a) + log(b) – and most inputs are multiplicative rather than additive. Could you comment this.
> In light of this, our dataset includes the count of elementary and middle 249
schools within a 500 m radius from the property, represented as ‘ELET_SCH’ and ‘MDL_SCH’.
Have you tried different distances, e.g. 1km (see referencees at the end)? Have you tried to encode e.g. 2-5 closest school distances? In my opinion it might be a better solution. For example I live in an area that the closest school is 600m from me. But within 1.5 km there are 5 schools. In my opinion similar solution should be used for railway station – the distance to the closest 2-5 railway stations. For the polynomial regression you might use not distance but inverse distance as the larger the distance the lower the price.
the floor number – do you have information about the lift (elevator). For example, when elevator is available, the higher the floor the lager price and vice versa.
Fig. 6 shows that the most important factor is WTD_SUBW_RANK – which in my opinion defines also the distance from the city center – the closer to the city center the more people pass the station. Therefore in my opinion to show how important the railway station is, you should predict the real estate price (SMUP) without providing information about the WTD_SUBW_RANK, NEAR_SUBW_DIST, but including the city center distance. Then use additional inputs: WTD_SUBW_RANK, NEAR_SUBW_DIST and compare these two prediction values. You may also illustrate the prediction error as a function of NEAR_SUBW_DIST (or inverse NEAR_SUBW_DIST).
This paper is very interesting but the results presentations should be improved. For example it would be very interesting to see e.g. polynomial regression approximation of: price vs. surface (+price prediction error vs. surface), SMUP as a function of city center distance, SMUP price adjustments as a function of ELET_SCH distance (use two models with and without ELET_SCH).
When using ML it is difficult to understand the model behind. Therefore when you can use (polynomial) regression you may show interesting results which are easier to interpret. In my opinion in this paper more important is illustration of the price influence than the model prediction accuracy.
There are a lot of my comments, you do not need to include all of them in the paper however I would like to see your comments in a rebuttal file. The most important comment is: While reading this paper I have the problem with input data. How many do you have, how many for training and for validation! The number of data determines the used ML algorithms! There is no description in this paper how the number of training / validation data influence the inference accuracy. For ML it is very important to analyze some input data and results, e.g. I would like to see a map with railway station and the SMUP.
There are many similar papers therefore you work is not the only one!, e.g.:
https://www.sauder.ubc.ca/news/insights/property-train-how-new-subway-lines-boost-home-prices - simi
https://www.jstor.org/stable/44983752
https://journals.sagepub.com/doi/10.1177/0042098010371395 - similar conclusions to yours
https://scholarworks.rit.edu/cgi/viewcontent.cgi?article=11834&context=theses
Author Response
Please see the attachment
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Most of my comments have been addressed well. There are however some remaining issues:
1. Related to my previous comment 3, some reference should be made to the leaderboard itself.
2. The details on Figure 1(b) are very hard to make out; either use a clearer image with fewer irrelevant details, or a larger image so the dots stand out better against the other details.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
> 245 It’s worth noting that when the log-transformed ‘DLNG_AMOUNT’ is
divided by ‘XUAR’, the resultant ‘SMUT’ also inherently undergoes log-scaling. This ensures that 246
the distribution of ‘SMUT’ is further normalized, setting the stage for potential improvements in 247
our model’s performance. 248
Do you mean:
log ( DLNG_AMOUNT ) / XUAR = log ( SMUT)
In my opinion it should be:
log ( DLNG_AMOUNT / XUAR ) = log ( SMUT)
>RRMSE:
Have you considered the following error formula (I made a mistake in my previous comment)
RMSRE= SQRT ( AVG ( ( (pi-ai) / pi )^2 )) )
See: https://stats.stackexchange.com/questions/413209/is-there-something-like-a-root-mean-square-relative-error-rmsre-or-what-is-t
In my opinion it is better type of error as for RRMSE error is dominated by very large prices (pi). And in my opinion large XUAR are difficult to be predicted. On the other hand RMSRE is easier to illustrate.
In my opinion the main drawback of the paper is lack of results (conclusions) comparison with other similar papers. I know that it is difficult to compare when you have completely different data, but try somehow to comment them. Section 2 (related research) is 2 pages long – so try to compare your and other results.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf