“Garbage in, garbage out” — a commonly used phrase when it comes to data-handling.
In our previous articles, we covered few algorithms, viz., Logistic Regression, K-Nearest Neighbours (KNN) wherein we delved into the pros and cons of each model along with the best situation where applying these models would provide the best results.
While choosing and applying the right model makes a significant impact on accuracy, having a clean and reliable dataset would contribute more to that accuracy.
In this article, we will dive deep into various data pre-processing techniques and understand their necessity in different situations.
When we look into the dataset, we would find some discrepancies in it. This could be in the form of missing values, duplicate entries, incompatible formats, etc. As a result, we need to rectify these issues and this process of resolving such discrepancies is called data cleaning.
Missing values are allocated NaN (Not a Number) values in a dataset by default. However not all NaN values mean missing values. In some cases, NaN turns out to provide information to the dataset. Therefore to find the missing values, we are finding the number of NaN in every column first and the command to find NaN in every column is:
Using this command will result in a table that shows the number of NaN values present in each column. After finding out the number of NaN present in every column, we need to understand the column and the various values in it.
As a rule of thumb :
1. Find the percentage of NaN values in each column. If it is more than 25–30%, look into the column and understand whether NaN is used as data or missing value. The code segment needed to find the percentage of NaN values in every column is:
2. If you think that it is used as a missing value, find out the correlation between the target column (the value you are trying to predict) and the assessment you are assessing. If it has a high correlation, then leave the column else, remove the column. The code segment needed to find the correlation of one column concerning the other for the whole dataset is :
Noisy Data :
Noise is defined as the instance that obscures the relationship between attributes and class. Noise can be classified into 2 types:
(i) Class noise, (ii) Attribute Noise
Noisy data can be present in our dataset as a result of the improper filling of data which leads to contradictory examples, erroneous values, etc. Let’s look into how to solve these problems :
1) Binning :
In this method, the whole dataset is first sorted and then the sorted values are split down into bins of specific size. After splitting, the data inside the bins are then smoothened by using one of the three methods that are listed down below :
- Smoothing by bin means: Each value in a bin is replaced by the mean value of the bin
- Smoothing by bin median: Each value in a bin is replaced by the median value of the bin
- Smoothing by bin boundary: Each value in a bin is replaced by the boundary value it is closest to (max and min value of bin are identified as bin boundaries)
Binning results in a smoothing effect on the input data and may also reduce the chances of overfitting in the case of small datasets. This method is mainly used to minimize the effects of small observation errors.
2) Regression :
Regression can be defined as a data mining technique that is generally used to predict a range of continuous values (which can also be called “numeric values”) in a specific dataset. While we have used regression models like linear regression and logistic regression for prediction, we can also use those regression models to fit on the dataset and find the outliers present in our dataset.
3) Clustering :
Similar to regression, clustering is also used to detect outliers present in the dataset. To find this, we can use the clustering methods such as K-means and consequently find the outliers present in the dataset.
Data Transformation (Feature Engineering)
Feature engineering is the technique to improve machine learning model performance by transforming original features into new and more predictive ones. This technique helps our model provide more accurate results. Various activities are performed in the name of feature engineering.
1) Normalisation :
This operation is done to change the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values or losing information. Normalization will prove out to be of significant importance when models like KNN are applied, as they deal with distance measurement. Therefore, normalizing the data, in general, is always useful.
There are 2 methods of normalization generally followed :
- Min-max normalization
Code snippet for Min-max normalization :
- Z-score normalization
Code snippet for Z-score normalization :
- Mean normalization
Code snippet for Mean normalization :
2) Attribute Selection :
Attribute Selection is the task of choosing a small subset of features/attributes that is sufficient to predict the target labels well. While attribute selection and dimensionality reduction have the same aim of reducing the number of attributes, attribute selection constrains itself from making a subset of attributes from the given set of attributes and in the case of dimensionality reduction, synthetic features can be made.
Discretization is the process through which we can transform continuous variables, models, or functions into a discrete form. We do this by creating a set of contiguous intervals (or bins) that go across the range of our desired variable/model/function.
There are various types of discretization :
- Equal Width Discretizer: All bins in each feature have identical widths.
Code Snippet :
- Equal Frequency Discretizer: All bins in each feature have the same number of points.
Code Snippet :
- K-Means Discretizer: Values in each bin have the same nearest center of a 1D k-means cluster.
Code Snippet :
Discretization is mainly done to reduce the noise in data. By discretizing the continuous variables into small bins we are, therefore ‘smoothing’ the data by removing all the small fluctuations. This can make a significant difference in the accuracy of the model. While discretization and binning look very similar, the difference lies in how the values in bins are dealt with.
4) Concept Hierarchy Generation:
A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-level, more general concepts.
Consider a concept hierarchy for the column location. City values for a location can be mapped to the province or state to which it belongs. The provinces and states can in turn be mapped to the country. These mappings form a concept hierarchy for the dimension location, mapping a set of low-level concepts (i.e., cities) to higher-level, more general concepts (i.e., countries).
There are multiple ways to do this :
- Partial ordering manually :
In the case of location which was discussed above, we know the order of hierarchy so that it can be ordered manually.
- Partial ordering by data grouping:
In places where the numerical data doesn’t provide as much data as it is expected to, you can group that data and try to classify it in other forms alongside maintaining its correlation. For example, for a retail dataset containing prices of products, you can define products’ price as expensive, cheap, etc based on a scale.
Concept hierarchy is mainly useful when it comes to exploratory data analysis where understanding the significance of each column is a major factor.
5) Categorical Encoding :
While so far we have dealt with how to convert numerical data to categorical data and how to make numerical data more suitable for the model to use and predict the target variable, now we will be looking into how to convert categorical data to numerical data. This is done because not all models can take categorical data as input. There are several methods through which you can do this :
- Using map function:
In this method, you will find the various values that a column contains after which you will write a specific map function to map those categorical values to numerical values.
Code Snippet :
- Label Encoding
In this method, we will change the column's data type from ‘object’ to ‘category’ type following which we will create a new column that will contain the encoded data. Changing the original column with encoded data will result in not being able to find the actual category it belongs to.
Code Snippet :
- One-Hot Encoding
In this method, we will be adding new columns to the dataset by replacing the existing column with columns that assign a value of 0 or 1 to represent the presence of a unique value for the respective column. This method is used because in the case of label encoding the numerical value assigned to various unique entries can be misinterpreted by algorithms. However, the disadvantage of using this method is adding more columns to the dataset.
Code Snippet :
Let’s say a column named ‘col_i’ has N_i unique values. Then the number of columns in the dataset after one-hot encoding is N_1+ N_2…..N_n.
As the size of datasets increase, it becomes difficult to come up with reliable solutions. There are various problems faced due to large datasets like Storage costs, Missingness, and Mixed data. The process of decreasing the amount of storage space required is known as data reduction. Data reduction can increase storage efficiency and reduce costs.
Techniques for Data Reduction:
1) Data Cube Aggregation:
This technique is used to aggregate data in a simpler form.
For example, consider the data you obtained for your study from 2012 to 2014, which contains your company’s sales every three months. But your task is based on yearly sales, rather than quarterly averages. So, you summarise the data in such a manner that the final data highlights total sales per year rather than quarterly averages. This is a summary of the information.
2) Attribute Subset Selection:
Constructing new features using the given features is called Attribute Subset Selection. For example, features like gender and student can be combined to form a feature male student/ female student.
3) Dimensionality reduction:
Dimensionality reduction refers to techniques that reduce the number of input variables in a dataset. High-dimensionality statistics and dimensionality reduction techniques are also used for data visualization.
4) Numerosity Reduction:
There are two types of methods for numerosity reduction:
- Parametric methods: These methods are generally used to represent data and build regression models.
- Non Parametric methods: These methods allow storage of reduced representation through Histograms, Data sampling, and Data cube aggregation.
Exploratory Data Analysis(EDA):
Exploratory Data Analysis is the approach used for analyzing datasets and summarizing the main characteristics using statistical graphics and data visualization techniques. EDA was originally developed by John Tukey in the 1970s.
The primary goal of EDA is to help in the analysis of data before making any assumptions. It can aid in the detection of evident errors and a better understanding of data patterns, the detection of outliers or unusual events, and identifying relationships between variables.
Types of EDA:
1) Univariate non-graphical:
Analyzing data consisting of only one variable. It basically focuses on finding patterns within the data.
2) Univariate graphical:
Graphical methods provide a better insight into the data as compared to non-graphical methods. Common methods are:
- Stem leaf plots: Stem-and-leaf plots are a method for showing the frequency with which certain classes of values occur.
- Histograms: It provides a visual interpretation of numerical data by showing the number of data points that fall within a specified range of values (called “bins”).
- Box Plots: A box and whisker plot, also called a box plot, displays the five-number summary of a set of data. The five-number summary is the minimum, first quartile, median, third quartile, and maximum.
a. Minimum: Q1 -1.5*IQR
b. First quartile: The central point that falls between the smallest value of the dataset and the median.
c. Median: the middle value of the dataset
d. Third quartile: The central point that lies between the median and the highest number of the distribution.
e. Maximum: Q3 + 1.5*IQR
The Box Plot is a very useful tool when showing a statistical distribution.
3) Multivariate non-graphical:
This EDA technique is used for multiple variables. It uses cross-tabulation and statistics to find relationships within variables.
4) Multivariate graphical:
Graphics are used to demonstrate relationships between two or more sets of data in multivariate data. Common techniques are:
- Bar plots: A bar chart is used when you want to show a distribution of data points or perform a comparison of metric values across different subgroups of your data. We can see which groups are highest or most common, and how other groups compare against the others.
- Scatter plots: Scatter plots’ primary uses are to observe and show relationships between two numeric variables. A scatter plot can also be useful for identifying patterns in data.
- Multivariate charts and Run charts: These are used to study collected data for trends or patterns over a specific period.
- Bubble charts: A bubble chart, like a, scatter plot, is used to illustrate and show correlations between numeric data. The addition of marker size as a dimension enables for comparison of three variables rather than just two.
- Heatmaps: Heat Maps utilize color-coded systems. Heat Maps are primarily used to better show the number of features/events within a dataset and guide users to the most important data visualizations.
Data preprocessing and Exploratory Data Analysis are essential tasks for any data science project. They are distinct terms but have many overlapping subtasks that are used interchangeably.
Having the highest accuracy in prediction or classification is our ultimate goal. To achieve that, we have to understand how to make sure the data we provide to our model is accurate and compatible with the model that we are deploying. This is where Data preprocessing and Exploratory Data Analysis comes in.
Having gone through this article, you should have gained an overall understanding of where and how to employ different methods in suitable places to make the dataset more accurate and interpretable. Now, you have got the most powerful tool in your hand.
So, revise, code, and watch out for our next article!
Follow us on: