We can use RFM as in the case of customers to determine the key areas of our assortment. We can do it in a classic way by counting individual percentiles and then constructing appropriate groups from them determining those that are important to us. An alternative is to use models from the field of unsupervised learning Unsupervised learning so that the algorithm itself finds clusters that are relevant to it. In this approach we will use this alternative we will choose one of the most popular models of this type i.e. K Means which uses the average distances of points in space to perform cluster analysis. We have yet to decide how many clusters we will define. In the case of the K Means algorithm this is crucial.
There are several ways to do this we will use the fact that our value is determined by only dimensions total_quantity total_ordered total_amount . If there are more of them there are ways to reduce them but this is done at some cost then it is better to consider choosing a different method of determining the number of clusters. fig = plt.figure figsize= ax = fig.add_subplot projection=' d' fig.patch.set_facecolor 'white' ax.scatter df_products['total_quantity'] df_products['total_ordered'] df_products['total_amount'] marker= + s= cmap= RdBu ax.set_xlabel total_quantity fontsize= ax.set_ylabel total_ordered fontsize= ax.set_zlabel total_amount fontsize= ax.legend plt.show Products described by three features can be visualized in D space Taiwan WhatsApp Number List There are clearly main clusters and or additional ones which may be the so called leftovers outliers . You should focus on these outliers there are several ways to deal with them depending on the situation. We for the sake of simplicity will allow our algorithm to add these points to one of the next two clusters.
Model initialization k_model = KMeans n_clusters= init='k means++' max_iter= n_init= random_state=RANDOM_SEED fitting the model to the data k_model.fit df_products[['total_quantity' 'total_ordered' 'total_amount']] assignment to products defined by the cluster model df_products['cluster'] = k_model.labels_ at a glance has basic descriptive statistics of our clusters df_products[['cluster' 'total_quantity' 'total_ordered' 'total_amount' ]].groupby 'cluster' .agg ['count' 'median' 'sum'] As you can see the aggregated values of products within clusters differ significantly from each other. Having defined labels we can proceed to constructing the model.