+ Reply to Thread
Results 1 to 2 of 2

Thread: Best way to treat integer/float column with null values in logistic regression

  1. #1
    Points: 35, Level: 1
    Level completed: 70%, Points required for next Level: 15

    Location
    Sydney Australia
    Posts
    1
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Best way to treat integer/float column with null values in logistic regression




    Hi,

    I was wondering if anyone can assist me with this issue.

    I am building a logistic regression model to predict purchase or not purchase based on web site behaviour data.

    One of the factors that I would like to include in the model is the visits to purchase and the days to purchase. The problem that I have is that in the case where the visitor has not purchased both of these are null. My first approach was to fill the null values with 0 but the resulting model looks too good to be true as the visits to purchase is the single biggest factor in the model.

    When I run this through Python/Scipy with null values I get a problem with the message "LinAlgError: SVD did not converge" so I expect that I need to give these a value. I know that this is not a python forum but my question is more general for logistic regression models rather than a code related question.

    I would greatly appreciate any assistance that the experts on this forum could provide.

    Best

    Rod

  2. #2
    Points: 2,758, Level: 32
    Level completed: 6%, Points required for next Level: 142

    Posts
    21
    Thanks
    10
    Thanked 1 Time in 1 Post

    Re: Best way to treat integer/float column with null values in logistic regression


    Have you found a workaround for this yet? I'm curious.

    By encoding NULL for non-purchases, you are essentially creating a category for those cases but this is not going to be meaningful since this "predictor" is something generated after-the-fact. Unless you know the outcome, you wouldn't be able to make this predictor in the first place so that's why I'm also not sure how to deal with it.

    What if you try combining these two metrics into one? Something like "Average visits from purchase per day" I would suspect that a person might check an item multiple times prior to buying it, so a high visit count leading up to a purchase would make sense. To measure that, I might try counting the total visits between the first visit and purchase then divide by the number of days in between. If there is no purchase, the value could be the count of visits from the first visit to the end of the data set time period divided by total days in the time period. People who don't purchase would generally have the same amount of (total visits in time period / # of days from first visit to end of time period) whereas people who do purchase something might have a higher (total visits from first visit to purchase / # of days from first visit to purchase).

+ Reply to Thread

           




Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts






Advertise on Talk Stats