In part 4 of this series, we will examine distribution of individual data points along the principal components (PCs) identified using factor analysis for mixed data (FAMD). This provides some ideas as to how distinct certain subpopulations are within the entire dataset, which would be helpful to know before attempting any machine learning algorithms for clustering and/or prediction.
If you have not seen the first three parts, please find them under the "Files" tab, as things will make quite a bit more sense with those in mind. :)
PCAmixdata can plot all individual points in the dataset along any two given PCs and (most importantly) colour them according to a grouping variable. So I will shown only the
FactoMineR implementation here.
In our Telco dataset example, I coloured the points by the target variable,
As individuals with similar profiles (in this case, in terms of customer behaviour) are close to each other on the figure, given the large overlap between the "Churn" and "No churn" populations of customers, this suggests that if there are significant/meaningful differences between the two populations, they are likely complex and non-linear.
Nevertheless, we see more separation between the two populations along PC2 than PC1. Recall from part 2 of this series that, after varimax rotation, PC2 is most associated with the variables
Tenure. This suggests that these two properties may represent notable differences between the "Churn" and "No churn" customers, and worth further investigation.
If you want to use Python, the
prince package appears to have a similar function (
famd.plot_row_coordinates()). However, as the package documentation is still in its early stages, I am not confident in enough in my interpretation of the parameters involved to show it here. As more information becomes available, I will update this post.
See you in the next post! :)