The third and final part of 17 new must-know Data Science interview questions and answers covers A/B testing, data visualization, Twitter influence evaluation, and Big Data quality.
This post contains answers to:
- Q13. What makes a good data visualization?
- Q14. What are some of the common data quality issues when dealing with Big Data? What can be done to avoid them or to mitigate their impact?
- Q15. In an A/B test, how can we ensure that assignment to the various buckets is truly random?
- Q16. How would you conduct an A/B test on an opt-in feature?
- Q17. How to determine the influence of a Twitter user?
Q13. What makes a good data visualization?
Gregory Piatetsky answers:
Note: This answer contains excerpts from the recent post What makes a good data visualization – a Data Scientist perspective.
Data Science is more than just building predictive models – it is also about explaining the models and using them to help people to understand data and make decisions. Data visualization is an integral part of presenting data in a convincing way.
There is a ton of research of good data visualization and how people best perceive information – see work by Stephen Few and many others.
Guidelines on improving human perception include:
- position data along a common scale
- bars are more effective than circles or squares in communicating size
- color is more discernible than shape in scatterplots
- avoid pie chart unless it is for showing proportions
- avoid 3D charts and reduce chartjunk
- Sunburst visualization is more effective for hierarchical plots
- use small multiples (even though animation looks cool, it is less effective for understanding changing data.)
See 39 studies about human perception, by Washington Post graphics editor for a lot more detail.
From Data Science point of view, what makes visualization important is highlighting the key aspects of data – what are the most important variables, what is their relative importance, what are the changes and trends.
Data visualization should be visually appealing but not at the expense of loading a chart with unnecessary junk, like in this extreme example on the right.
How do we make a good data visualization?
To do that, choose the right type of chart for your data:
- Line Charts to track changes or trends over time and show the relationship between two or more variables.
- Bar Charts to compare quantities of different categories.
- Scatter Plots show joint variation of two data items.
- Pie Charts to compare parts of a whole – used them sparingly since people have hard time comparing the area of pie slices
- You can show additional variables on a 2-D plot using color, shape, and size
- Use interactive dashboards to allow experiments with key variables
Here is an example of visualization of US Presidential Elections, 1976-2016, that shows multiple variables at once: the electoral college votes difference (y-axis), the % popular vote difference (X-axis), the size of the popular vote (circle area), winner party (color), and winner name and year (label). See my post on What makes a good data visualization for more details.
US Presidential Elections, 1976-2016
- What makes a good visualization, David McCandless, Information is Beautiful
- 5 Data Visualization Best Practices, GoodData
- 39 studies about human perception in 30 minutes, Kenn Elliott
- Data Visualization for Human Perception, landmark work by Stephen Few (key ideas summarized here)
Q14. What are some of the common data quality issues when dealing with Big Data? What can be done to avoid them or to mitigate their impact?
Anmol Rajpurohit answers:
The most common data quality issues observed when dealing with Big Data can be best understood in terms of the key characteristics of Big Data – Volume, Velocity, Variety, Veracity, and Value.
In the traditional data warehouse environment, comprehensive data quality assessment and reporting was at least possible (if not, ideal). However, in the Big Data projects the scale of data makes it impossible. Thus, the data quality measurements can at best be approximations (i.e. need to be described in probability and confidence intervals, and not in terms of absolute values). We also need to re-define most of the data quality metrics based on the specific characteristics of the Big Data project so that those metrics can have a clear meaning, be measured (good approximation) and be used for evaluating the alternative strategies for data quality improvement.
Despite the great volume of underlying data, it is not uncommon to find out that some desired data was not captured or is not available for other reasons (such as high cost, delay in getting it, etc.). It is ironical but true that data availability continues to be a prominent data quality concern in the Big Data era.
The tremendous pace of data generation and collection makes it incredibly hard to monitor data quality within a reasonable overhead on time and resources (storage, compute, human effort, etc.). So, by the time data quality assessment completes, the output might be outdated and of little use, particularly if the Big Data project is to serve any real-time or near real-time business needs. In such scenarios, you would need to re-define data quality metrics so that they are relevant as well as feasible in the real-time context.
Sampling can help you gain speed for the data quality efforts, but this comes at the cost of a bias (which eventually makes the end result less useful) because of the fact that samples are rarely an accurate representation of the entire data. Lesser samples will give higher speed, but with a bigger bias.
Another impact of velocity is that you might have to do data quality assessments on-the-fly, i.e. somewhere plugged-in within the data collection/transfer/storage processes; as the critical time-constraint does not give you the privilege of making a copy of a selected data subset, storing it elsewhere and running data quality assessments on it.
One of the biggest data quality issues in Big Data is that the data includes several data types (structured, semi-structured, and unstructured) coming in from different data sources. Thus, often a single data quality metric will not be applicable for the entire data and you would need to separately define data quality metrics for each data type. Moreover, assessing and improving the data quality of unstructured or semi-structured data is way more tricky and complex than that of structured data. For example, when mining the physician notes from medical records across the world (related to a particular medical condition) even if the language (and the grammar) is same the meaning might be very different due to local dialects and slang. This leads to low data interpretability, another data quality measure.
Data from different sources often has serious semantic differences. For example, “profit” can have widely varied definitions across the business units of an organization or external agencies. Thus, the fields with identical names may not mean the same thing. This problem is made worse by the lack of adequate and consistent meta-data from each data source. In order to make sense of data, you need reliable metadata (such as to make sense of sales numbers from a store, you need other information such as date-time, items purchased, coupons used, etc.). Usually, a lot of these data sources are outside an organization and thus, it is very hard to ensure good metadata for such data.
Another common issue is syntactic inconsistencies. For example, “time-stamp” values from different sources would be incompatible unless they are captured along with the time zone information.
Veracity, one of the most overlooked Big Data characteristics, is directly related to data quality, as it refers to the inherent biases, noise and abnormality in data. Because of veracity, the data values might not be exact real values, rather they might be approximations. In other words, the data might have some inherent impreciseness and uncertainty. Besides data inaccuracies, Veracity also includes data consistency (defined by the statistical reliability of data) and data trustworthiness (based on data origin, data collection and processing methods, security infrastructure, etc.). These data quality issues in turn impact data integrity and data accountability.
While the other V’s are relatively well-defined and can be easily measured, Veracity is a complex theoretical construct with no standard approach for measurement. In a way this reflects how complex the topic of “data quality” is within the Big Data context.
Data users and data providers are often different organizations with very different goals and operational procedures. Thus, it is no surprise that their notions of data quality are very different. In many cases, the data providers have no clue about the business use cases of data users (data providers might not even care about it, unless they are getting paid for the data). This disconnect between data source and data use is one of the prime reasons behind the data quality issues symbolized by Veracity.
The Value characteristic connects directly to the end purpose. Organizations are harnessing Big Data for many diverse business pursuits, and those pursuits are the real drivers of how data quality is defined, measured, and improved.
A common and old definition of data quality is that it is the “fitness of use” for the data consumer. This means that data quality is dependent on what you plan to do with the data. Thus, for a given data two different organizations with different business goals will most likely have widely different measurements of data quality.This nuance is often not well understood – data quality is a “relative” term. A Big Data project might involve incomplete and inconsistent data, however, it is possible that those data quality issues do not impact the utility of data towards the business goal. In such a case, the business would say that the data quality is great (and will not be interested in investing in data quality improvements). For example, for a producer of mashed potato cans a batch of small potatoes would be of same quality as a batch of big potatoes. However, for a fast food restaurant making fries, the quality of the two batches would be radically different.
The Value aspect also brings in the “cost-benefit” perspective to data quality – whether it would be worth to resolve a given data quality issue, which issues should be resolved on priority, etc.
Putting it all together:
Data quality in Big Data projects is a very complex topic, where the theory and practice often differ. I haven’t come across any standard theory yet that is widely-accepted. Rather, I see little interest in the industry towards this goal.In practice, data quality does play an important role in the design of Big Data architecture. All the data quality efforts must start from a solid understanding of high-priority business use cases, and use that insight to navigate various trade-offs (samples given below) to optimize the quality of the final output.
Sample trade-offs related to data quality:
- Is it worth improving the timeliness of data at the expense of data completeness and/or inadequate assessment of accuracy?
- Should we select data for cleaning based on cost of cleaning effort or based on how frequently the data is used or based on its relative importance within the data models consuming it? Or, a combination of those factors? What sort of combination?
- Is it a good idea to improve data accuracy through getting rid of incomplete or erroneous data? While removing some data, how do we ensure that no bias is getting introduced?
Given the magnanimous scope of work and very limited resources (relatively!), one common way for data quality efforts on Big Data projects is to adopt the baseline approach, in which, the data users are surveyed to identify and document the bare minimum data quality needed to ensure that the business processes they support are not disrupted. These minimum satisfactory levels of data quality are referred to as the baseline, and the data quality efforts are focused on ensuring that data quality for each data does not fall beyond its baseline level. It looks like a good starting point and you may later move into more advanced endeavors (based on business needs and available budget).
Summary of Recommendations to improve data quality in Big Data projects:
- Identify and prioritize the business use cases (then, use them to define data quality metrics, measurement methodology, improvement goals, etc.)
- Based on a strong understanding of the business use cases and the Big Data architecture implemented to achieve them, design and implement an optimal layer of data governance (data definitions, metadata requirements, data ownership, data flow diagrams, etc.)
- Document baseline quality levels for key data (think of “critical-path” diagram and “throughput-bottleneck” assessment)
- Define ROI for data quality efforts (in order to create feedback loop on the ROI metric to improve efficiency and to sustain funding for data quality efforts)
- Integrate data quality efforts (to achieve efficiency through minimizing redundancy)
- Automate data quality monitoring (to reduce cost as well as to let employees stay focused on complex tasks)
Do not rely on machine learning to automatically take care of poor data quality (machine learning is science and not magic!)
Feature image credit