Open In Colab

1. Introduction


Good quality and consistently formated data is absolutely key for any successful data analysis project. In this ongoing series, I will create a personal knowledge base of various aspects of data cleaning and transformation.

Where applicable, I will use the Telco data set as the example, though other data sets will appear as I demonstrate new techniques.

Let's get started!

2. Import data


We will get the dataset in its raw form:

3. Intial check


Recall from our preliminary exploratory data analysis using the dataMaid package in a previous post that there are several issues with this data set that need to be fixed:

  • The customerID column needs to be removed
  • Headers of the gender and tenure columns are not capitalized, while those of the other columns are
  • The TotalCharges column has some missing values and at least one row has Tenure=0
    • It is quite possible that the rows where Tenure=0 are also the rows where TotalCharges values are missing, as it would be impossible to calculate the latter without a positive value of the former
  • The SeniorCitizen column is of integer type as it is encoded in 0s and 1s, while other categorical features are encoded in strings as the object type
  • Several categorical features have three levels (e.g. 'Yes'/'No'/'No internet service') where the 'No internet service' level contains very few data points and can be consolidated into 'No'

We take a quick look at the data and column properties to confirm that these are indeed in need of addressing:

Loading output library...

Additionally, we see that the TotalCharges column has the wrong data type (object rather than float64), which is also masking the missing values.

Let's starting cleaning!

4. Data cleaning


4.2 Rename columns for consistent capitalization


We first convert TotalCharges to the right column data type, float64:

Now we see that indeed there are 11 values missing in the TotalCharges column.

Let's check if these 11 rows also contain Tenure=0:

Loading output library...

Indeed, the two sets of problematic data points are the same. As there are only 11 such data points out of >7,000 in total, we can remove them.

Loading output library...

For the sake of consistency with the other categorical features, we will encode the SeniorCitizen column in "Yes"/"No":

4.5 Combine sparse levels


Also, we will consolidate the 'No internet service' levels into 'No' to reduce cardinality:

5. Final check

Loading output library...

We quickly check that the appropriate category levels have been converted to "Yes"/"No" encoding:

6. Export to CSV file


Here I export the cleaned data set to CSV for later use.

6.1 Retain "Yes"/"No" encoding of categorical variables


This form of the dataframe is useful for survival analysis and feature engineering for machine learning.

6.2 Rename category levels to reflect variable name


To prepare the dataframe for factor analysis, we will rename levels of all categorical variables to reflect the column name.

Loading output library...

A quick check that all is well with the category level names:

Any comments or suggestions for improvements will be greatly appreciated!

Til next time! :)