I need to estimate the age of labor market entry for a given year (2016), but I can only use four explanatory variables: age, sex, income value, and the category in which the worker classifies himself (employee, self-employed, domestic worker, and employer). The data comes from a household sample survey. Additionally, I thought about running a logistic regression in order to estimate the probability of contributing to Social Security in that same year, and, then, using it as a fifth explanatory variable. For that matter, I could use other variables, such as level of education, and so on… In the end, I would have five explanatory variables (age, sex, income value, category and probability of contribution in 2016) and the response variable would be the age of labor market entry (age when started first job).
I know it is not much, but is that acceptable enough to model the age of entry? Which model could I use? Should I introduce any modified variable (like the square of age) in order to capture changes between generations, for example?
I need this modelling in order to apply it on administrative records containing the history of contributions to the social security system. In this official dataset (the whole universe, not a sample), I would only use the same five variables: age, sex, the category in which the worker is officially classified (employee, self-employed, etc.), if the insured contributed or not in 2016, and, if so, the mean value the official income value registered for the same given year (2016).
Unfortunately, I only have monthly data for the last 10 years (monthly records between jan/2007-dec/2016). The older records (before jan/2007) are also available, but are aggregated, so I can tell how many years a given person contributed before this 10-year-period but I do not know when these contributions were made nor when this person made its first contribution or entered the labor market. In countries where informality is high, these ages may differ greatly. Since I want to estimate the contribution density (number of months of contributions up to dec/2016 ÷ number of months since the worker entered the labor market up to dec/2016) for the whole population of insured workers, I need to input for each insured at the dataset an estimate for the age of labor market entry.
I know it is not much, but is that acceptable enough to model the age of entry? Which model could I use? Should I introduce any modified variable (like the square of age) in order to capture changes between generations, for example?
I need this modelling in order to apply it on administrative records containing the history of contributions to the social security system. In this official dataset (the whole universe, not a sample), I would only use the same five variables: age, sex, the category in which the worker is officially classified (employee, self-employed, etc.), if the insured contributed or not in 2016, and, if so, the mean value the official income value registered for the same given year (2016).
Unfortunately, I only have monthly data for the last 10 years (monthly records between jan/2007-dec/2016). The older records (before jan/2007) are also available, but are aggregated, so I can tell how many years a given person contributed before this 10-year-period but I do not know when these contributions were made nor when this person made its first contribution or entered the labor market. In countries where informality is high, these ages may differ greatly. Since I want to estimate the contribution density (number of months of contributions up to dec/2016 ÷ number of months since the worker entered the labor market up to dec/2016) for the whole population of insured workers, I need to input for each insured at the dataset an estimate for the age of labor market entry.