The importance and process of cleaning data
Once the data has been acquired, it will need to be cleaned. Frequently, the data will contain errors, duplicate entries, or be inconsistent. It often needs to be converted to a simpler data type such as text. Data cleaning is often referred to as data wrangling, reshaping, or munging. They are effectively synonyms.
When data is cleaned, there are several tasks that often need to be performed, including checking its validity, accuracy, completeness, consistency, and uniformity. For example, when the data is incomplete, it may be necessary to provide substitute values.
Consider CSV data. It can be handled in one of several ways. We can use simple Java techniques such as the String
class' split
method. In the following sequence, a string array, csvArray
, is assumed to hold comma-delimited data. The split
method populates a second array, tokenArray
.
for(int i=0; i<csvArray.length; i++) {
tokenArray[i] = csvArray[i].split(",");
}
More complex data types require APIs to retrieve the data. For example, in Chapter 3, Data Cleaning, we will use the Jackson Project (https://github.com/FasterXML/jackson) to retrieve fields from a JSON file. The example uses a file containing a JSON-formatted presentation of a person, as shown next:
{
"firstname":"Smith",
"lastname":"Peter",
"phone":8475552222,
"address":["100 Main Street","Corpus","Oklahoma"]
}
The code sequence that follows shows how to extract the values for fields of a person. A parser is created, which uses getCurrentName
to retrieve a field name. If the name is firstname
, then the getText
method returns the value for that field. The other fields are handled in a similar manner.
try {
JsonFactory jsonfactory = new JsonFactory();
JsonParser parser = jsonfactory.createParser(
new File("Person.json"));
while (parser.nextToken() != JsonToken.END_OBJECT) {
String token = parser.getCurrentName();
if ("firstname".equals(token)) {
parser.nextToken();
String fname = parser.getText();
out.println("firstname : " + fname);
}
...
}
parser.close();
} catch (IOException ex) {
// Handle exceptions
}
The output of this example is as follows:
firstname : Smith
Simple data cleaning may involve converting the text to lowercase, replacing certain text with blanks, and removing multiple whitespace characters with a single blank. One way of doing this is shown next, where a combination of the String
class' toLowerCase
, replaceAll
, and trim
methods are used. Here, a string containing dirty text is processed:
dirtyText = dirtyText
.toLowerCase()
.replaceAll("[\\d[^\\w\\s]]+", "
.trim();
while(dirtyText.contains(" ")){
dirtyText = dirtyText.replaceAll(" ", " ");
}
Stop words are words such as the, and, or but that do not always contribute to the analysis of text. Removing these stop words can often improve the results and speed up the processing.
The LingPipe API can be used to remove stop words. In the next code sequence, a TokenizerFactory
class instance is used to tokenize text. Tokenization is the process of returning individual words. The EnglishStopTokenizerFactory
class is a special tokenizer that removes common English stop words.
text = text.toLowerCase().trim();
TokenizerFactory fact = IndoEuropeanTokenizerFactory.INSTANCE;
fact = new EnglishStopTokenizerFactory(fact);
Tokenizer tok = fact.tokenizer(
text.toCharArray(), 0, text.length());
for(String word : tok){
out.print(word + " ");
}
Consider the following text, which was pulled from the book, Moby Dick:
Call me Ishmael. Some years ago- never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.
The output will be as follows:
call me ishmael . years ago - never mind how long precisely - having little money my purse , nothing particular interest me shore , i thought i sail little see watery part world .
These are just a couple of the data cleaning tasks discussed in Chapter 3, Data Cleaning.