Data Cleaning

Data cleaning is the process of modifying data using processes such as:

  • selection

  • correction

  • alteration

  • rearrangement

  • regrouping

  • removal

for purposes such as:

  • consistency with other data

  • consistency with an intended use

  • elimination of incorrect data

  • elimination of unwanted data

Benefits of Data Cleaning using SQL Code over Python Pandas Code

Very Large Datasets Handling

The size of datasets Python Pandas can handle depends on the memory of the processor. SQL code is designed to handle datasets of any size.

Ease of Handling Simple Query Logic

If only simple data cleaning logic is being applied, SQL code can be simpler than Python Pandas code.

Benefits of Data Cleaning using Python Pandas Code over SQL Code

In general, simple data cleaning operations might be more easily handled using SQL code. As data cleaning complexity moves from simple to increasing complexity, the advantages of using Python Pandas becomes significant.

Logic Implementation

Python Pandas has significantly more capabilities and flexibility in implementing code cleaning logic than SQL code. As SQL code logic grows in complexity, it becomes unwieldy and difficult to debug and maintain.

Code Reflection Capability

Python Pandas code has the capability for reflection, that is, to modify itself depending on execution time requirements.

Programming Constructs

The Python language has a very large number of programming constructs compared to SQL.

Code Modification and Testing

Code logic can be compartmentalized and documented for relatively easy modification and testing compared to SQL.

Python Pandas DataFrame Functions

Using Python Pandas is one effective way to clean data. It provides a large set of data manipulation functions.

Below are select Pandas DataFrame functions useful for data cleaning:

  • abs( ) - returns absolute numeric value of each element

  • add( ) - adds additional data

  • add_prefix( ) - adds string prefix

  • add_suffix( ) - adds string suffix

  • aggregate( ) - aggregates data

  • align( ) - aligns two objects with a specified join method

  • all( ) - returns whether all elements are True

  • any( ) - returns whether any element is True

  • append( ) - appends rows of other to the end of caller, returning a new object

  • apply( ) - applies a function along an axis (columns or rows) of the DataFrame

  • applymap( ) - applies a function to DataFrame elements

  • asfreq( ) - converts a time series to a specified frequency.

  • assign( ) - assigns new columns to a DataFrame.

  • astype( ) - casts an object to a specified data type

  • at_time( ) - selects values at particular time of day

  • between_time( ) - selects values between particular times of the day

  • bool( ) - returns the boolean of a single element

  • clip( ) - trim values at a given threshold

  • columns( ) - the column names of the DataFrame

  • combine( ) - performs column-wise combine with another DataFrame

  • combine_first( ) - updates null elements with value in the same location in other

  • copy( ) - makes a copy of an object’s indices and data

  • count( ) - counts non-NA cells

  • cummax( ) - return a cumulative maximum over a DataFrame or Series axis

  • cummin( ) - returns a cumulative minimum over a DataFrame or Series axis

  • cumprod( ) - returns a cumulative product over a DataFrame or Series axis

  • cumsum( ) - returns a cumulative sum over a DataFrame or Series axis.

  • describe( ) - generates descriptive statistics

  • diff( ) - first discrete difference of element

  • divide( ) - gets Floating division of DataFrame and other elements

  • drop( ) - drops specified labels from rows or columns

  • drop_duplicates( ) - returns DataFrame with duplicate rows removed

  • droplevel( ) - returns DataFrame with requested index / column level(s) removed

  • dropna( ) - removes missing values

  • duplicated( ) - returns boolean Series denoting duplicate rows

  • eq( ) - gets Equal to of DataFrame and other, element-wise

  • equals( ) - test whether two objects contain the same elements

  • explode( ) - transforms each element of a list-like to a row, replicating index values

  • fillna( ) - fills NA/NaN values using the specified method

  • filter( ) - subsets the DataFrame rows or columns according to the specified index labels

  • first_valid_index( ) - returns index for first non-NA/null value

  • from_dict( ) - constructs a DataFrame from a dictionary

  • from_records( ) - converts records to a DataFrame

  • get( ) - gets an item from object for given key

  • groupby( ) - groups DataFrame using a mapper or by a series of columns

  • gt( ) - gets greater than of a DataFrame

  • head( ) - returns the first n rows

  • idxmax( ) - returns the index of first occurrence of maximum over requested axis

  • idxmin( ) - returns the index of first occurrence of minimum over requested axis

  • infer_objects( ) - attempts to infer better types for object columns

  • insert( ) - inserts a column into DataFrame at specified location

  • interpolate( ) - interpolates values according to specified methods

  • isin( ) - determines whether each element in the DataFrame is contained in specified values

  • isna( ) - detects missing values.

  • isnull( ) - detects null values

  • items( ) - iterates over (column name, Series) pairs

  • iteritems( ) - iterates over (column name, Series) pairs.

  • iterrows( ) - iterates over DataFrame rows as (index, Series) pairs

  • itertuples( ) - iterates over DataFrame rows as named tuples

  • join( ) - joins columns

  • last( ) - subsets final periods of time series data based on a date offset

  • last_valid_index( ) - returns the index for last non-NA/null value

  • le( ) - gets less than or equal to of DataFrame

  • loc( ) - selects a range of rows

  • lookup( ) - label-based indexing function for DataFrames

  • lt( ) - gets less than of DataFrame

  • mad( ) - returns the mean absolute deviation of values

  • mask( ) - replaces values where a specified condition is True

  • max( ) - returns the maximum of the values for specified data

  • mean( ) - returns the mean of the values for specified data

  • median( ) - returns the median of the values for specified data

  • memory_usage( ) - returns the memory usage of each column in bytes

  • merge( ) - merges DataFrame or named Series objects with a database-style join

  • min( ) - returns the minimum of the values for specified data

  • mode( ) - gets the mode(s) of each element for specified data.

  • multiply( ) - gets multiplication of DataFrame and other data

  • ne( ) - gets not equal to of DataFrame and other data

  • nlargest( ) - returns the first n rows ordered by columns in descending order

  • notna( ) - detects existing (non-missing) values

  • notnull( ) - detects existing (non-missing) values.

  • nsmallest( ) - returns the first n rows ordered by columns in ascending order

  • nunique( ) - counts distinct observations over specified data

  • pct_change( ) - percentage change between the current and a prior element

  • pivot_table( ) - creates a spreadsheet-style pivot table as a DataFrame

  • pop( ) - returns item and drops it from DataFrame

  • prod( ) - returns the product of the values for the specified data

  • product( ) - returns the product of the values for the specified data

  • query( ) - queries the columns of a DataFrame with a boolean expression

  • radd( ) - gets addition of DataFrame and specified data

  • read_gbq( ) - reads Google BigQuery data into a DataFrame

  • read_json( ) - reads JSON data into a DataFrame

  • read_pickle( ) - reads pickled file data into a DataFrame

  • read_sql_query( ) - reads an SQL DB table into a DataFrame using an SQL query

  • read_sql_table( ) - reads an SQL DB table into a DataFrame

  • rdiv( ) - gets floating point division of DataFrame and other specified data

  • reindex( ) - conforms DataFrame to new index with optional filling logic

  • reindex_like( ) - returns an object with matching indices as other object

  • rename( ) - alters axes labels

  • rename_axis( ) - sets the name of the axis for the index or columns

  • replace( ) - replaces values with new values in a DataFrame

  • resample( ) - resamples time-series data

  • reset_index( ) - resets an index

  • rfloordiv( ) - gets Integer division of a DataFrame and specified data

  • rmod( ) - gets modulo of a DataFrame and specified data

  • rmul( ) - gets Multiplication of a DataFrame and specified

  • round( ) - rounds a DataFrame to a variable number of decimal places

  • rsub( ) - gets subtraction of a DataFrame and specified data

  • rtruediv( ) - gets floating point division of a DataFrame and specified data

  • sample( ) - returns a random sample of selected DataFrame elements

  • select_dtypes( ) - returns a subset of the DataFrame’s columns based on column types

  • sem( ) - returns the unbiased standard error of the mean over specified DataFrame data

  • set_axis( ) - assigns a desired index to a given DataFrame axis

  • set_index( ) - sets a DataFrame index using existing columns

  • shift( ) - shifts an index by a specified number of periods

  • size( ) - returns an int representing the number of elements in the object

  • skew( ) - returns an unbiased skew over a requested DataFrame axis

  • sort_index( ) - sorts by labels along a DataFrame axis

  • sort_values( ) - sorts by values along a DataFrame axis

  • squeeze( ) - squeezes a 1 dimensional axis objects into scalars

  • std( ) - returns sample standard deviation over a specified DataFrame axis

  • sub( ) - gets subtraction of a DataFrame and specified data

  • subtract( ) - gets subtraction of a DataFrame and element-wise data (binary operator sub)

  • sum( ) - returns the sum of the values for the specified axis

  • tail( ) - returns the last n rows of data

  • take( ) - returns the elements in the given positional indices along an axis

  • to_clipboard( ) - copies objects to the system clipboard

  • to_csv( ) - writes object to a comma-separated values (csv) file

  • to_datetime( ) - converts fields to datetime based on supplied specifications

  • to_dict( ) - converts a DataFrame to a dictionary

  • to_excel( ) - writes objects to an Excel spreadsheet

  • to_gbq( ) - writes a DataFrame to a Google BigQuery table

  • to_hdf( ) - writes data to an HDF5 file using HDFStore

  • to_html( ) - renders a DataFrame as an HTML table

  • to_json( ) - converts an object to a JSON string

  • to_latex( ) - renders an object to a LaTeX object

  • to_markdown( ) - prints DataFrame in Markdown-friendly format

  • to_numpy( ) - converts a DataFrame to a NumPy array

  • to_parquet( ) - writes a DataFrame to the binary parquet format

  • to_period( ) - converts a DataFrame from DatetimeIndex to PeriodIndex

  • to_pickle( ) - writes an object to Pickle (serialized) object file

  • to_records( ) - converts a DataFrame to a NumPy record array

  • to_sql( ) - writes records stored in a DataFrame to a SQL database

  • to_stata( ) - exports a DataFrame object to Stata (statistics) dta format

  • to_string( ) - renders a DataFrame to a console-friendly tabular output

  • to_timestamp( ) - casts to DatetimeIndex of timestamps, at beginning of period

  • to_xarray( ) - returns an xarray (multi-dimensional) object from the pandas object

  • transform( ) - produces a DataFrame using a transform function

  • transpose( ) - transposes index and columns

  • truediv( ) - gets floating point division of a DataFrame and specified element-wise data

  • truncate( ) - truncates a Series or DataFrame before and after a specified index value

  • tshift( ) - shift a time index

  • tz_convert( ) - converts tz-aware axis to target time zone

  • tz_localize( ) - localizes a tz-naive index of a Series or DataFrame to target time zone

  • update( ) - modifies in place using non-NA values from another DataFrame

  • value_counts( ) - returns a Series containing counts of unique rows in the DataFrame

  • var( ) - returns unbiased variance over the specified axis

  • where( ) - replaces values where the condition is False

Python Example using Pandas

To download the code below click here.

"""
data_cleaning_using_pandas.py
provides examples of using a Python Pandas DataFrame and its methods
"""

# Import needed libraries.
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
import pprint

# Set parameters.
selected_row_count = 20

# Load test data.
iris_data = load_iris()

# Create a DataFrame from the test data.
data_frame = pd.DataFrame(iris_data.data, columns=iris_data.feature_names)
data_frame_subset = data_frame.head(selected_row_count)

# Print the data.
print("Data Subset of First " + str(selected_row_count) + " Rows")
print(data_frame_subset.to_string())

# Get a count of non-NA cells.
non_na_cell_count = data_frame_subset.count()

# Print the count.
print("Non Empty Cell count:")
print(non_na_cell_count)

# Get the mean of cell rows.
mean_of_rows = data_frame_subset.mean(axis=1)

# Print the means.
print("Means of Rows:")
print(mean_of_rows)

# Get the mean of cell columns.
mean_of_columns = data_frame_subset.mean(axis=0)

# Print the means.
print("Means of Columns:")
print(mean_of_columns)

# Round values to 0 decimal places.
data_frame_with_rounded_values = data_frame_subset.round(decimals=0)

# Print the DataFrame.
print("Data Subset with Rounded Values:")
print(data_frame_with_rounded_values.to_string())

# Drop duplicate rows.
data_frame_rounded_without_duplicates = data_frame_with_rounded_values.drop_duplicates()

# Print the means.
print("Data Subset with Rounded Values without Duplicates:")
print(data_frame_rounded_without_duplicates.to_string())

Output is below:
Data Subset of First 20 Rows
    sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                 5.1               3.5                1.4               0.2
1                 4.9               3.0                1.4               0.2
2                 4.7               3.2                1.3               0.2
3                 4.6               3.1                1.5               0.2
4                 5.0               3.6                1.4               0.2
5                 5.4               3.9                1.7               0.4
6                 4.6               3.4                1.4               0.3
7                 5.0               3.4                1.5               0.2
8                 4.4               2.9                1.4               0.2
9                 4.9               3.1                1.5               0.1
10                5.4               3.7                1.5               0.2
11                4.8               3.4                1.6               0.2
12                4.8               3.0                1.4               0.1
13                4.3               3.0                1.1               0.1
14                5.8               4.0                1.2               0.2
15                5.7               4.4                1.5               0.4
16                5.4               3.9                1.3               0.4
17                5.1               3.5                1.4               0.3
18                5.7               3.8                1.7               0.3
19                5.1               3.8                1.5               0.3
Non Empty Cell count:
sepal length (cm)    20
sepal width (cm)     20
petal length (cm)    20
petal width (cm)     20
dtype: int64
Means of Rows:
0     2.550
1     2.375
2     2.350
3     2.350
4     2.550
5     2.850
6     2.425
7     2.525
8     2.225
9     2.400
10    2.700
11    2.500
12    2.325
13    2.125
14    2.800
15    3.000
16    2.750
17    2.575
18    2.875
19    2.675
dtype: float64
Means of Columns:
sepal length (cm)    5.035
sepal width (cm)     3.480
petal length (cm)    1.435
petal width (cm)     0.235
dtype: float64
Data Subset with Rounded Values:
    sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                 5.0               4.0                1.0               0.0
1                 5.0               3.0                1.0               0.0
2                 5.0               3.0                1.0               0.0
3                 5.0               3.0                2.0               0.0
4                 5.0               4.0                1.0               0.0
5                 5.0               4.0                2.0               0.0
6                 5.0               3.0                1.0               0.0
7                 5.0               3.0                2.0               0.0
8                 4.0               3.0                1.0               0.0
9                 5.0               3.0                2.0               0.0
10                5.0               4.0                2.0               0.0
11                5.0               3.0                2.0               0.0
12                5.0               3.0                1.0               0.0
13                4.0               3.0                1.0               0.0
14                6.0               4.0                1.0               0.0
15                6.0               4.0                2.0               0.0
16                5.0               4.0                1.0               0.0
17                5.0               4.0                1.0               0.0
18                6.0               4.0                2.0               0.0
19                5.0               4.0                2.0               0.0
Data Subset with Rounded Values without Duplicates:
    sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                 5.0               4.0                1.0               0.0
1                 5.0               3.0                1.0               0.0
3                 5.0               3.0                2.0               0.0
5                 5.0               4.0                2.0               0.0
8                 4.0               3.0                1.0               0.0
14                6.0               4.0                1.0               0.0
15                6.0               4.0                2.0               0.0