Data processing provides the connection between data capture and data output. In this article, we look at 10 key things everyone needs to know about scientific data processing.
1. Data Needs Cleaning
If not cleaned properly, errors in data can contribute to severely inaccurate results. All data sets will have errors. The bigger the data sets, the more possible errors creep in. Errors can come from incorrect experiments, incorrect capture, or incorrect entry. User entries can contain complexities like abbreviations, mis-spellings, and varied terms for non-varied entities. Data containing errors can’t be used effectively. One of the key jobs of the data processing scientist, that must be done right, is to clean data, that is, to effectively and efficiently remove errors from data so that it is useful. It’s not as exciting or glamorous as analytical models, but it is vitally important for the latter to work, and must be done right.
2. No Model is Perfect
When you’ve invested a lot of time developing a model from strict analytical procedures, a vested interest in the process tends to provide us with an overly positive view of the results. It’s important to remember that all models are just an assimilation, approximations, and estimations. They might be only an estimated 1% chance for error, but this in itself is only an estimation. It is very important to remember that modeling is only a predictive tool, theoretical, and looking at possible outcomes, not a scientific absolute.
3. Approximate Can Sometimes be Good Enough
A recent study detailed how a major entertainment company sped up it’s data processing time in huge proportions by using an APPROX_DISTINCT(x) function in place of COUNT(DISTINCT x). The application is amazingly as simple as it looks, even to non-programmers, one is approximate, one is exact. With data sets in the millions, this replacement resulted in only a 4% error on data, which for the purpose was quite acceptable, at the same time staggeringly reducing computing time from hours to minutes.
Reduction in computing time of this level can save the company huge amounts on hardware, and increase profits by bringing valuable data to where it’s needed far quicker, with only a small error rate.
As data scientists, we can sometimes get bogged down by trying to ensure accuracy to within 99.9%, when even in straight descriptive data applications, approximations are often good enough.
3. Big Data is a Tool, not a Solution
So much focus is put on “Big Data” that we sometimes forget, it is merely a material resource to work with not a solution. Big Data needs processing like any other data. Even more so. Although Big Data provides lots of insights, once transformed, it still needs cleaning and transforming. It’s vitally important to remember the place of Big Data in data science.
4. Transforming Data Should Not Rely on Any One Technique
While it’s tempting to chase the latest shiny object, data science, as the name implies, relies on a scientific approach. When any one technique is favored above a comprehensive approach, then the scientific method breaks down. When scientific data processing is applied, the analyst uses a variety of approaches to drawing conclusions. The more comprehensive the approach, the more sound the results will be.
5. Break Down Data Into Subsets
Only when data is broken down can relationships start to be seen. An important facet of scientific data processing is the concept of breaking down data. Once subsets of data are developed, new trends and hypothesis can be developed.
6. Relationships are Important
When processing data, recognizing data relationships is an imperative skill. Relationships are the key to developing data models, and discovering truths behind the data that will benefit management and users. Without finding a relationship the data is meaningless. Although relational databases have categories that relate, to draw new discoveries, there needs to be a method of finding new relationships between data sets.
7. Non-relational Databases are Also Important
Non-relational databases are more difficult to query, but have the power to section, model, and analyze both structured and unstructured data. Where, with large relational databases there is potential for larger errors in fields such as object-relational impedance mismatch, non-relational databases maintain more pure sets of data where objects can overlap easily. Non-relational databases are an important concept in scientific data processing, especially in the growing sizes of Big Data.
8. Correlation and Causation are not the Same Things
While establishing relationships, categorizing variables, and breaking down data into variables are all important, all of these areas can introduce an error in matching correlation with causation. The same result doesn’t indicate the same cause. When we draw parallels from two sets of data to establish a pattern, we may include variables that do not in fact match. This is where an analysts skill in scientific data processing is required.
Two sets of data with the same trends don’t imply the same cause. A simple yet humorous example is the correlation that most people who floss are not obese, so flossing causes weight loss.
To avoid spurious correlation errors data processing scientists need to ensure data is considered from as many angles as possible, and most importantly, it’s the human input here that can often see something a machine cannot. Human logic recognizes that flossing doesn’t make you thin, data processor’s intuitive knowledge needs to be used to view data sets and add quantifiers where needed in the presentation of facts.
9. To be Scientific, Data Processing needs Bayesian Logic
Data sets and statistics can be used, or more correctly, misused to prove anything. A Bayesian approach will help ensure accuracy.
What does Bayesian Logic mean? Bayesian logic is based on Bayes model of error, and means disproving as well as proving. If we prove a theory, this is fine, but it is only a theory. We need to consider the chance of error based on all possible outcomes, which means we need to consider the chance of alternative theories.
A simple explanation of the power of Bayesian logic is to consider conducting an experiment which involves blindly drawing different colored balls from a basket. Although improbable, it’s entirely possible that the same color is drawn every time. Without considering Bayes model, we may wrongly conclude there are no other colors.
10. A Good Data Scientist Asks Lots of Questions
The last and most important point to remember about scientific data-processing, is that questions need to be asked, don’t assume! A mental model, which is developed by good question and answer tactics, is the starting point of all good data analysis. If one doesn’t ask the right questions, one will not receive the right answers. It will be pointless spending large amounts of time on a data set, if one starts with the wrong assumptions, because of not obtaining all the facts. Only with good quality questions can one find good quality answers. A good data scientist is one who asks questions.
Everyone has the ability to benefit from scientific data processing. We need only remember scientific data processing is a science when it adheres strictly to scientific principles. It’s important that work is quantitative, qualitative, and fact based. Keep asking questions, and keep validating, and use all the resources in your tool box.