How to Replace Ambiguous Values

Recently I had a friend tell me they were working on a project where they had to hand scrub the data in their database. In particular that day they were manually scrubbing values like 'Not Applicable', 'NA' and 'N/A' which all meant the same thing. But in order for the data to be consistent they need to go in by hand and change all these values to be the same. Now I want to say they were not working on SQL Server, in fact they were not even getting to work in a relational database at all, it was some obscure Linux based database with a very expensive support contract.

I told them they could use one of several methods in my world to remedy their data real quick. The first is a simple SQL Update statement, which in all fairness is all they really needed, if they were working in a real database. the next option I had was to use a Regular Expression to replace the values and restore them. This option could have been viable for them because to actually work with the data they had to export to a delimited file anyway (don't even ask).

So to use a SQL Update statement is it pretty easy, update the value of the field to the desired value where it equals the other possible values. It is a good idea first to see what possible values are available by running a select distinct on the column, sorting alphabetically.

select distinct Contact_Status from Customer

distinct Contact_Status from Customer

order by Contact_Status ASC

by Contact_Status ASC

This way you know you will have a full list of possible values that need to be updated. Now we can create and execute the replacing Update statement.

Update Customer

Customer

Set Contact_Status = 'N/A'

Contact_Status = 'N/A'

Where Contact_Status = 'Not Applicable' or Contact_Status = 'NA'

Contact_Status = 'Not Applicable' or Contact_Status = 'NA'

This will do the trick for a standard SQL database or Access. But what if we have a delimited file, how do you do the same thing. Well it is pretty easy to run through the rows of the file, examining a field and then updating that field.

Public

Sub

UpdateFields(

ByVal

sFile

As

String

,   

ByVal

sTargetValue

As

String

,

ByVal

sBadValues()

As

String

)

 

       

If

sBadValues.Length = 0

Then

           

Return

       

End

If

 

       

Dim

i

As

Integer

= 0

       

Dim

sRegEx

As

String

=

String

.Empty

 

       

For

i = 0

To

sBadValues.Length - 1

           

If

i > 0

Then

                sRegEx = sRegEx &

String

.Format(

'|\b{0}\b'

, sBadValues(i))

           

Else

                sRegEx = sRegEx &

String

.Format(

'\b{0}\b'

, sBadValues(i))

           

End

If

       

Next

 

       

Dim

_Regex

As

New

Regex(sRegEx)

 

       

Dim

line

As

String

       

Dim

newLine

As

String

 

       

If

File.Exists(sFile &

'.temp'

)

Then

            File.Delete(sFile &

'.temp'

)

       

End

If

 

       

Dim

sr

As

StreamReader = File.OpenText(sFile)

        line = sr.ReadLine

       

While

Not

line

Is

Nothing

            newLine = _Regex.Replace(line, sTargetValue)

           

My

.Computer.FileSystem.WriteAllText(sFile &

'.temp'

,

'Text'

,

True

)

            line = sr.ReadLine

       

End

While

        sr.Close()

 

       

If

File.Exists(sFile &

'.temp'

)

Then

           

File.Delete(sFile)

            File.Move(sFile &

'.temp'

, sFile)

       

End

If

 

   

End

Sub

This method first checks to make sure we have at least one replacement value. It would be good to also validate the remaining parameters too. The next step is to build the RegularExpression to be used to find the value to replace. In this case it will be an expression looking for the words passed in the sBadValues array. The '\b' is the RegularExpression escape for a word break.

Next declare an instance of the RegEx object using the RegularExpression we created. Then I check to see if there is a temporary file to hold our new values. Then I open the file with the questionable data and loop through it, running each line through the RegEx object. The Replace method will replace any matching values with the desired or sTargetValue.

After looping through the file it is closed. Then I simply move the temporary file over the initial file.

Public Sub ReplaceNAValues()

Dim sBadValues() As String = {'Not Applicable', 'NA'}

UpdateFields('c:\BadData.tsv', 'N\A', sBadValues)

End Sub

This is an example method that calls our standard replace method. In this example I show how to setup the variables used the base method to replace the undesired values.

Share This Article With Your Friends!

Googles Ads Facebook Pixel Bing Pixel LinkedIn Pixel