How Do I Clean My Dirty Data Quickly?

Data is one of the most important types of information in our world today. It’s collected, shared, traded , analyzed, and visualized. The strategic use of data helps organizations understand and improve their business processes and allows them to reduce wasted money and time. . Not all data is created equal, however. Unorganized data, or “dirty data,” doesn’t provide much value and represents missed opportunities. One of the best examples of dirty data in digital asset management (DAM) is disorganized, unstructured metadata.

As a DAM service provider, dirty metadata is one of the most common problems Stacks is brought in to help solve. For example, we recently worked with a client to organize over 1,000,000 digital assets with non-compliant metadata. The company’s new marketing strategy relied on having additional metadata fields and clean and clear data to study and build reports on. Because of the importance of the project, we only had two months to sanitize the data. Accordingly, we had to focus our work on the most valuable data.

The Data Highway: How Do We Get from Point A to Point B?

When prioritizing assets to organize, it helps to break them into smaller groups before building out the logic for the entire project. Below are some common ways to do this:

File or folder path - File or folder path information is usually very helpful when cleaning data. Assets are often organized in folders the way they should be tagged. You can often scrape a TON of information from a simple file path and use mapping to get a solid start on building out basic tags for your assets.
File Type - File type usually has a direct relationship to key data which can be applied to assets. This means it can be easily used as a filtering tool. For example, MP4 files in a folder named “Campaign” can quickly be tagged as “Campaign Videos.”
Synonyms - Is the metadata just not on brand or in CV? If so, are there synonym matches you can utilize or branching matches which align with the proper metadata?
Filename - Utilizing the information in filenames to clean up data is one of the easiest ways to take back control of a set of unstructured data.

Using these strategies, you can quickly develop a set of data to apply to your assets to clean them up and make them easy to maintain.

‍

What Tools Can We Use?

Cleaning dirty data can be an expensive and time-consuming task. If you can avoid manual file by file work, the savings are immense. Below are some tools to better understand the problem (data mines) and help determine if you can solve it efficiently by using scripting, CSV files, etc.

1) Data Mine / Data Visualization

A data mine is a comprehensive report about the data in your library. It will tell you the number and type of assets in the library as well as other important information which can help you develop a strategy for cleaning your dirty data.

The report is usually not very expensive, makes the project much easier to visualize, and is the best weapon in your arsenal.

2) Scripting / App Development

Once you’ve uncovered your data set and understand the mapping that needs to be done, figuring out if you can develop a script or simple app to help is KEY. The heart of this question is: Are there a logical set of rules that can be written to fix your dirty data?

For instance, when a large amount of metadata located in a file path needs to be applied to the metadata fields, you can create a rule list of IF-THEN statements to perform the task. For Example, IF “SPR20” is in the file path name, THEN you can apply “Spring 2020 Campaign” as a keyword.

This kind of scripting and app development may seem daunting, but it's much more cost and time effective than making the changes manually. You can also develop reporting tools, error tracking, etc. to help scale and control the quality even more.

3) CSV Files

When in doubt, try to capture all your data in a CSV file. Doing massive clean-up inside a DAM platform is typically not the best process. Managing data this way is often difficult and can be confusing.

By exporting data from your DAM or server into CSV files, you can usually make massive changes in them by using pivot tables and other Excel tools.

For example, Stacks once did a project that required over 20 million total changes to the core metadata set and used Excel tools to accomplish it in six weeks, far faster than anticipated.

Conclusion

Before cleaning your dirty data it’s important to learn more about your library. How many assets do you have? What type? Can they be divided into smaller groups? What tools can you use to organize your library in an efficient, logic-driven way? Sometimes manual labor is the only answer; However, you'll be surprised how often you can use some level of scripting technology to save both time and money.

How Do I Clean My Dirty Data Quickly?

The Data Highway: How Do We Get from Point A to Point B?

What Tools Can We Use?

1) Data Mine / Data Visualization

2) Scripting / App Development

3) CSV Files

Conclusion

More from DAM Best Practices

File Naming Conventions: How To Name Files

Folder Structure 101: Best Practices & Tips

Digital Asset Management (DAM) Platform Configuration Guide

Successfully Take the Next Step in DAM