+ Reply to Thread
Results 1 to 3 of 3

Thread: What model/method to use to predict/impute missing values?

  1. #1
    Points: 5,139, Level: 45
    Level completed: 95%, Points required for next Level: 11

    Posts
    61
    Thanks
    3
    Thanked 4 Times in 4 Posts

    What model/method to use to predict/impute missing values?




    First of all, apologies for the long and somewhat general post. I need to analyse the following tourism-related dataset and I'd appreciate any ideas / comments / suggestions. The data consists of information on tourist establishments, and more specifically the following variables:

    ID: Tourism establishment ID (not used) - around 800-850 unique IDs
    Year: 2009-2016
    Month: 1-12
    Region: region in which the establishment is located (6 levels)
    Category: type of accommodation establishment (eg Star Hotel, Apartment, Camping Site, Villas etc - in total there are 9 different levels)
    Class: depending on Category, another classification variable (eg Start Hotels have 1*-5* levels, Appartments have A-C levels etc). Overall there are 16 different combinations of category/class
    DaysOpen: Number of days the establishment was open during the particular month ranging from 0-31.
    BedCapacity: number of officially registered beds in the establishment. There range from small units of 2 beds to large hotels of 800 or 900 beds)
    Arrivals: number of arrivals at the establishment in the particular month. They could range from 0 (if there were no arrivals in a particular month) to maybe 10000 or 20000.
    Overnights: number of overnight stays at the establishment in the particular month. Again ranging from 0 to 40000 or 50000.

    My dataset has approximately 69000 cases (around 19000 had DaysOpen = 0 i.e. they were closed, and of the remaining cases, only 77% have submitted data - see below).

    Data collection is an ongoing process (i.e the government department I work at continues to collect the 2017 data and will continue to do so in the future). The problem we are faced with is that some establishments either don't send their monthly data on arrivals and overnights or are late in doing so (eg may send data 6 months late). Note that we don't have any missing data regarding the rest of the variables which are collected via a different process. However, we need to produce summary tables of the number of arrivals/overnights as well as Bed Utilisation Rates (defined as Sum(Overnights) / Sum(DaysOpen * BedCapacity) over a particular time period) at regular points in time.

    As a result, we need to impute the missing values on arrivals/overnights through some model or otherwise, to be able to produce the necessary tables. My question is what would the most appropriate procedure be, taking into consideration the following clarifications/notes:
    • I expect that there is significant correlation between consecutive years for the same establishment (i.e. last year's data should be a good predictor of this year's data), if we take into consideration the overall trend (i.e whether tourism has generally increased during that year).
    • At the same time, most establishments exhibit a high degree of seasonality (summer months are much busier than winter ones) and there are also large discrepancies between regions (seaside areas much more popular) and possibly between different categories/classes.
    • Another point to note is that arrivals/overnights are positively correlated with BedCapacity or rather with the product BedCapacity * DaysOpen as that metric gives the total number of potential overnights an establishment could accommodate if full for the whole month.
    • A further point is that some of the variables Category/Class and BedCapacity may change during the 2009-2016 period for a particular establishment (eg if an establishment has been upgraded or expanded)
    • And finally, not all establishments operate during the whole 2009-2016 period. Some might have started operation after 2009 and others may have stopped operating before 2016.

    I'm not sure what model would make best use of my available data. First of all, would it be better to model the actual number of arrivals/overnights or would the large range of values create its own problems and need for transformations? Maybe it's better to use the Bed Utilization rate as the dependent variable which will have a fixed scale (0% to 100%) for all establishments?

    Irrespective of what my dependent variable will be chosen to be, what sort of model should I aim for? Perhaps a (generalized) linear mixed effects model with a seasonal time series structure (is that even possible?) but I don't know if the missing data, the short time series, or the different interactions create a problem to build such a model. On the other hand, I'm thinking that a random forest model might account for the different interactions but how would I account for an establishment's previous performance or the general development of tourism in other similar establishments? And moreover, would it be better to have the bed utilization rate estimated at the end of each tree, or some sort of regression tree of the number of arrivals / overnights against the BedCapacity * DaysOpen variable?

    As you can see, I'm quite confused at how best to approach it and I'd be grateful for any pointers you may share with me. Feel free to ask for any clarifications or additional information you may need.

    Thank you SO much for reading this and for any comments you may have.

  2. #2
    Fortran must die
    Points: 58,790, Level: 100
    Level completed: 0%, Points required for next Level: 0
    noetsi's Avatar
    Posts
    6,532
    Thanks
    692
    Thanked 915 Times in 874 Posts

    Re: What model/method to use to predict/impute missing values?

    The first thing you have to decide is if the data is miss completely at random, missing at random, or missing not at random. If its the latter, that is something about overnight stays itself is causing missing data, you are out of luck. There are no good solutions for that. For instance if as overnight stays take longer there is more missing data you have a serious problem (I am assuming overnight stays is the only variable you have missing data on). If the data is missing completely at random you can just delete the missing data, but its usually best to assume its missing at random unless you are sure its not and in any case you say you want the records regardless.

    You next need to consider if overnight stays is an interval, ordinal, or categorical variable and how much data it is missing. And you have to determine if the pattern of missing data is montone or arbitrary. For monotone missing data with a continuous variable missing data you would use linear regression, predictive mean matching, or propensity scoring. For an interval variable with arbitrary missing data (which is more likely) you would use Markov Chain Monte Carlo.

    Or so my books say. I have spent a lot of time reading material on missing data, but am hardly an expert

    Perhaps a (generalized) linear mixed effects model with a seasonal time series structure (is that even possible?) but I don't know if the missing data, the short time series, or the different interactions create a problem to build such a model.
    That scared me even hearing it.... What are you actually trying to do? With time series the simpler the better generally.
    "Very few theories have been abandoned because they were found to be invalid on the basis of empirical evidence...." Spanos, 1995

  3. #3
    Points: 5,139, Level: 45
    Level completed: 95%, Points required for next Level: 11

    Posts
    61
    Thanks
    3
    Thanked 4 Times in 4 Posts

    Re: What model/method to use to predict/impute missing values?


    Thanks noetsi for replying (and anyone else who has spent some time on this post).

    Quote Originally Posted by noetsi View Post
    The first thing you have to decide is if the data is miss completely at random, missing at random, or missing not at random. If its the latter, that is something about overnight stays itself is causing missing data, you are out of luck. There are no good solutions for that. For instance if as overnight stays take longer there is more missing data you have a serious problem (I am assuming overnight stays is the only variable you have missing data on). If the data is missing completely at random you can just delete the missing data, but its usually best to assume its missing at random unless you are sure its not and in any case you say you want the records regardless.
    You are right in assuming that the only variable I have missing data on is overnights (and arrivals, but whatever solution is applied to one, I'll apply to both variables as they are both my "dependents"). I don't think that there is any structure in the missingness in relation to the size of the overnight stays. So we may have both large hotels and small villas forgetting to submit their data, thus resulting in missing data which can either produce large values for overnights or small ones respectively.

    Quote Originally Posted by noetsi View Post
    You next need to consider if overnight stays is an interval, ordinal, or categorical variable and how much data it is missing. And you have to determine if the pattern of missing data is montone or arbitrary. For monotone missing data with a continuous variable missing data you would use linear regression, predictive mean matching, or propensity scoring. For an interval variable with arbitrary missing data (which is more likely) you would use Markov Chain Monte Carlo.
    Overnight stays is a positive continuous variable (well, actually, it can only take integer values, but in the range of 0 to a few tens of thousands, so for our purposes a continuous variable should be ok, even if the missing data are imputed by non-integers eg. 553.45 overnight stays). I understand your suggestions (at least, I think I do!) of linear regression etc but I fail to see how the time-series part is taken into consideration. I think the fact that I will probably have data on previous years/months for the particular establishment is an important part of the relationship in trying to estimate the missing values.

    If, for example, I'm missing data on July 2016 for a particular hotel and I know how that hotel had performed in the previous 4 Julys (2012-2015), as well as how each of those Julys compared to the eg January - June period of the corresponding year, isn't that important information to ignore when estimating the missing data for July 2016?

    By my understanding, a linear regression approach would simply look at how each factor affects the overnight stays and estimate the missing data based on that, ignoring the hotel-specific information, whenever that is available. This is what I was trying to explain with my "... (generalized) linear mixed effects model with a seasonal time series structure ..." which scared you off!

    Quote Originally Posted by noetsi View Post
    Or so my books say. I have spent a lot of time reading material on missing data, but am hardly an expert .
    I'm sure you are much more an expert than myself!

    I hope this clarifies things a little. Once again thanks for spending some time on this and I'd appreciate any further comments you, or anyone else, may have! Thanks!!

+ Reply to Thread

           




Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats