Skip to Main Content

Data Management: Data Cleaning

Resources for managing your data

What is data cleaning?

Data cleaning is the process of correcting or removing inaccurate or non-uniform entries from the data set. Data can be cleaned manually, but more often a program is used to make the process easier and more efficient.

There are different tools available to make the process of cleaning data more efficient.  If you are comfortable with programming, data cleaning can be done with SAS and R.

Software

Types of Data Cleaning

Split cells based on pattern matching of the contents.  Every cell in a column can be split on the nth character or between specified words.  The text on the left is placed in one column and the text after the split is placed in another column.

Cut will remove characters that match a pattern from your data.

Extract portions of cells based on a pattern that you specify and place those portions in a new column.

Edit the values in a specific cell

Fill copies values from adjacent cells that matches the patter you specify.

Delete rows or columns. (Drop)

Merge the values from separate columns into a single column.

Move the values from a row into the names of the columns. (Promote)

Convert columns into extra rows and visa versa. (Fold/Unfold)

Transpose the values using special rules.

Shift cells in any direction. (Translate)

Cluster cells that have similar entries and allow you to make all cells in a cluster have the same text.

Create a column based on information that is looked up on the internet. (URL or freebase)

Books