Python data cleaning basics part 1 (EDA, missing values & incorrect information).

Carlos Sanchez
7 min readOct 25, 2021

When working with datasets, you may find from time to time some surprises. Missing data, wrong format records are just a couple of common messy data errors you’ll probably need to fix before starting your data analysis or machine learning model. Of course, there is a vast universe of different and precise techniques for very specific contexts on data cleaning. We’ll just cover the most common ones.

For this article we will be working with the vehicles_messy.csv file which you can download in the link and also here you can find the google colab notebook with all the code.

This article is written thinking in those who are beginning with data analytics with Python, and the explanations will be very detailed, so those who are mid/advanced-level may find some explanations obvious, consider it.

Let’s start by importing the data from the CSV file. We’ll be working with pandas dataframes using the read_csv() function.

This is the simplest way to get data as a pandas dataframe. I create a variable called data to store the dataframe.

Data exploration.

Before we start using any method it’s always a good idea to do some data exploration, this way we can gain some context about our dataset at the same time that we can visually identify some errors in format or even missing values. To do this we will use the following function: pd.info(), pd.describe() & pd.head().

info() & describe() are the most common functions used for data exploration.

Info( ):

data.info() will show a resume of all the columns of our dataframe including non-null record quantity and type of data.

If you pay attention to the info() function we can have some context. First of all, now we know that all the columns have 37,843 rows, and looking at the “non-null count” column we can conclude that there are some columns with missing values. Also, you can see in the Dtype (data type) column that there are some columns with object types, this means that those columns have categorical information (color or shape, for example).

describe( ):

The describe() method is a bit more scaring at first. I used the transpose visualization to display it just because i feel more comfortable reading this way by using “.T” after data.describe(). In the image above you can see it, but I encourage you to try not using transpose and pick the most comfortable for you.

Ok, let’s see this table column by column. As I told you this could seem to be intimidating but actually it’s pretty simple. We have a series of columns with specific information for every column of the dataframe shown as a row in the table.

The first column(count) shows the same information we saw in the .info() method, a count of the number of rows that the column has. The second column (mean) informs us of the mean of the entire column. Then std column shows the standard deviation for the entire column, this is very useful when you’re doing statistic analysis. The min & max columns tell you exactly the min and max value of every column and finally the 25%, 50% y 75% shows exactly the percentiles taking into consideration the max value of every column.

There is another important comment to make about this. As we saw in the info() method, there are some columns with categorical information (object types) these columns will not appear in the describe() method due to there is no numeric information to show.

head( ):

Last but not least, head(). This function shows by default the first 5 rows of all the columns and this is totally useful, it’s exactly as we see tables in Excel, the dataframe will be shown like this:

You can visualize a dataframe using pd.head( ) and by default, it will show the first 5 rows of all the columns (this is just a small snap of the entire dataframe).

As I said, head() function has a default value, value that can be changed by passing an argument to the function (e.g., 10, or even less than 5!).

As a result, we can visually identify missing data and other errors. In the snap above we can see that the “guzzler” column is categorical (string values column) and it has some missing values. And that’s it, We’ve covered data exploration basics.

Missing values.

Ok, now hands to work. Let’s start with the one the is probably the most common reason for doing data cleaning, missing values. We saw in the last snap that there are some missing records, now let’s see how prevalent missing values are in our data. We want to see a list with all the columns with NaNs and the number of missing records in each column.

Pandas is provided with another useful tool to check if a record is a missing value or not, the isnull() method. This function will return either True or False for each field of a dataframe. This is almost what we need, but we can use the sum() method to total up the number of True values by column, and then we can use pandas filtering to show only those columns where the number of missing values is greater than zero. I know, this may sound super complex, but it’s not.

In the first instance we summarize all the missing values for every column by doing:

And then we use the following pandas filter:

This is very simple, but I’ll explain it to you. The null_cols variable is, in fact, a new dataframe with just 1 column and 83 rows. When we use square brackets after the name of a dataframe we are using a filter. This particular filter is asking the dataframe to show just those records with more than 0 null values and the result is a dataframe with just the columns with missing data and the number of values per column:

Ok, this brought us a new dilemma. We need to take time for reflection and introspection. What are we gonna do with the tens of thousands of missing records?

Usually, with real-life problems, you’ll need to ask yourself if those columns are significant for your analysis or not. For example, if the column ‘c240bDscr’ with 37,807 missing record were a ‘last name’ column then we could just delete that column due to the few or null value that adds to the analysis.

For our purposes let’s remove those columns with more than 10,000 null values. For this, we’ll use the drop() function specifying the argument axis = 1 which indicates that we are erasing columns (axis = 0 is used for erasing rows).

Again, we are using pandas filters but here I am adding the .index function which returns the number of the columns (if we would like to return the number of missing values we rather use the .values function instead), and then we cast the returned values to a list. Then we use the inplace parameter to drop the columns directly on the data dataframe, otherwise, we should assign this data frame to a new variable. After all this we just have the following columns with missing values:

As we can see, there are 2 columns with very similar null values: cylinders and displ. Maybe both columns have missing values for a similar reason. Let’s take a look.

Here is a very interesting insight hidden in the data. Those records where cylinders and displ are null have a fuelType value of Electricity which makes sense! The best option, in this case, is to replace the NaN values of both columns for zeros. Luckily pandas have a function for this: fillna().

Incorrect Values.

Additionally, sometimes there is data that can make no sense to its context. For example, in the previous section, we learned that a vehicle without cylinders should not have displacement and vice versa. Let’s check to see if any cases violate these rules.

The returned value is a violation of that rule. And you surely will face these cases out there in wild data of your company, my advice will always be to ask for help to a member of your team with the expertise to have the right answer and then correct all the violations or incorrect values you find.

Let’s say that the chief automobile engineer tells you that the Mazda 1986 RX-7 has 4 cylinders. Then, you immediately can correct the data using a loc statement:

Honestly, this last type of data cleaning is the most difficult to handle because it requires a good understanding of the data you are working with and your first option always should be to fix it from the origin. Whether it is due to an error in the original database query or it is an error in the car registration platform, etc.

For now, this is it. Next time we will revise more advanced methods like Low Variance Columns analysis and how to deal with extreme values and outliers using the Interquartile Range method.

If you liked my content let me know it with a comment, and if you loved it consider subscribing and following me on my social media: https://linktr.ee/theCarlos

--

--

Carlos Sanchez

Data Scientist passionate about data, innovation and technology. I love sharing knowledge, especially for complex problems with creative and funny solutions.