Good quality and consistently formated data is absolutely key for any successful data analysis project. In this ongoing series, I will create a personal knowledge base of various aspects of data cleaning and transformation.
Where applicable, I will use the Telco data set as the example, though other data sets will appear as I demonstrate new techniques.
Let's get started!
We will get the dataset in its raw form:
Recall from our preliminary exploratory data analysis using the
dataMaid package in a previous post that there are several issues with this data set that need to be fixed:
customerIDcolumn needs to be removed
tenurecolumns are not capitalized, while those of the other columns are
TotalChargescolumn has some missing values and at least one row has
Tenure=0 are also the rows where
TotalChargesvalues are missing, as it would be impossible to calculate the latter without a positive value of the former
SeniorCitizencolumn is of
integertype as it is encoded in 0s and 1s, while other categorical features are encoded in strings as the
We take a quick look at the data and column properties to confirm that these are indeed in need of addressing:
Additionally, we see that the
TotalCharges column has the wrong data type (
object rather than
float64), which is also masking the missing values.
Let's starting cleaning!
We first convert
TotalCharges to the right column data type,
Now we see that indeed there are 11 values missing in the
Let's check if these 11 rows also contain
Indeed, the two sets of problematic data points are the same. As there are only 11 such data points out of >7,000 in total, we can remove them.
For the sake of consistency with the other categorical features, we will encode the
SeniorCitizen column in "Yes"/"No":
Also, we will consolidate the 'No internet service' levels into 'No' to reduce cardinality:
We quickly check that the appropriate category levels have been converted to "Yes"/"No" encoding:
Here I export the cleaned data set to CSV for later use.
This form of the dataframe is useful for survival analysis and feature engineering for machine learning.
To prepare the dataframe for factor analysis, we will rename levels of all categorical variables to reflect the column name.
A quick check that all is well with the category level names:
Any comments or suggestions for improvements will be greatly appreciated!
Til next time! :)