The rarity score is a formula like so:

Rarity score = 1/(%Chance Of occurrence)

Let's say I have a trait that has 10% chance of occurring.

The rarity score for this trait will be:

10 = 1/(10%).

This score will be without trait normalization.

What I am trying to find out is how the process of trait normalization (or rarity normalization) is done.

From my research the normalization takes into account the amount of traits in a specific trait type.

Let's say we have two trait types:

Trait_Type: Hair-Color

Value: Green 1% Score 100

Value: Blue 99% Score: 1

Trait_Type: Shirt-Color

100 traits all having 1% chance of occurrence.

When we use the rarity calculator above all values of shirt colors will get the same 100 score as the score of a green hair-color.

This is not accurate, when we have 100 traits (or many traits) obviously they will have lower percentages granting each trait a higher score.

In reality each shirt-color isn't really worth because all have a 1% chance of occurring.

On the other hand the Green background color is really worth.

My goal is to introduce these differences and add trait count for each trait_type into account so when we score those traits the green will show way higher than a shirt-color.

The information I know is:

The chance of a trait happening.

The rarity score of it.

All the data about trait count (Trait type count, traits amount inside the trait etc..)

The farthest I got is:

Vanilla_score = 1/(%Chance of trait happening)

Normalized_score = (Vanilla_score*Avg number of traits per trait_type)/traits in category

This will not result in an accurate enough score.

If we take a trait_type called: `Flair`

Value: `hijab`

Avg Trait_count per category: `13.1875`

Trait_category_count: `16`

Trait_count_for_flair_category: `40`

The trait has a 0.44~% chance of occurring.

With the vanilla score it will give it a value of: `243.87`

With this method the normalized score will be: `80.4`

On the site I want to replicate the score is: `35.87`

What are other calculations that can be done to take into consideration the traits per trait_category into account?

(If any data is missing let me know and I will add it.)

Reference links:

Trait Normalization (at the end of website)

Explanation about current used formula