Data quality is one of the most challenging, yet most important elements of the software testing process. When it comes to testing big data, the bigger the data, the bigger the challenge.
As big data testing becomes more integral to enterprise app quality, testers must ensure that data is collected smoothly. Similarly, technologies that support big data become more important, including inexpensive storage, different types of databases and powerful — and readily available — computing.
Let’s examine how performance testing of big data applications can play a major role in a tester’s day-to-day tasks.
What is big data?
Generally, big data refers to data that exceeds the in-memory capability of traditional databases. Additionally, big data typically involves the collection of large amounts of disparate information on customers, transactions, site visits, network performance and more. Organizations must store all that data — perhaps over a long duration.
However, big data is more than just size. The most significant aspects of big data can be broken down into the six V’s:
- Volume: the sheer amount of data;
- Velocity: how quickly a system can create and transport the data;
- Variety: how many different types of data;
- Veracity: the data accuracy and quality;
- Variability: how data flows vary and often change; and
- Veracity: data quality based on its numerous sources and difficulty to link, transform and cleanse.
Big data and the business
Big data meets critical business needs and generates value because it provides organizations with critical information about their business trends, customers and competitors. The data enables analytics — which typically expresses results in statistical terms — such as trends, likelihoods or distributions; i.e., statistics an enterprise’s decision-makers might find helpful. Big data applications are all about analytics as opposed to data queries.
Big data is usually unstructured and doesn’t fit into a defined data model with organized columns and rows. The data can come in audio and visual formats like phone calls, instant messages, voicemails, pictures, videos, PDFs, geospatial data and slide shares. Data can also take the form of social media posts. The format and origins of a batch of big data can require special QA considerations. For example, to test big data collected from social media, testers might need to examine each separate social media channel to make sure displayed advertisements correspond to user buying behavior.
Testing big data applications
While testers don’t generally test the data itself, they will need an underlying knowledge of the database type, data architecture and how to access that database to successfully test the big data application. It’s unlikely that testers will use live data, so they must maintain their own test environment version of the database and enough data to make tests realistic.
Applications that rely on analytical output aren’t all the same. Rather than query the database for a specific result, a user is more likely to run statistical and sensitivity analyses. This likelihood means that the correct output — the answer — depends on distributions, probabilities or time-series trends. It’s impossible for testers to know answers ahead of time, because they’re often trends and complicated calculations, not simple fields in a database. And once testers find those answers, they won’t obviously be correct or incorrect, which adds another layer of uncertainty for testers who design test cases and analyze the results.
However, if testers view big data testing like a certain aquatic animal, it can help.
ELT testing the jellyfish
Big data testing is like testing a jellyfish: Because of the sheer amount of data and its unstructured nature (like how a jellyfish is a nebulous undefined shape), the test process is difficult to define. Testing will need automation, and although many tools exist, they are complex and require technical skills for troubleshooting.
At the highest level, the big data testing approach involves both functional and nonfunctional components. Functional testing validates data quality and processing data. All big data testing strategies are based on the extract, load, transform (ELT) process. Big data testing validates data quality from the source databases, the data structure transformation or process and the load into the data warehouse.
ELT testing has three phases:
- data staging
- MapReduce validation
- output validation
Data staging is validated by comparing the data coming from the source systems to the data in the staged location.
The next phase is the MapReduce validation, or validation of the data transformation. MapReduce is the programming model for unstructured data, with the Hadoop implementation most commonly used in the testing community. This testing ensures that the business rules an application uses to aggregate and segregate the data work properly.
The final ELT phase is the output validation phase where the output files from the MapReduce are ready to move to the data warehouse. By the time data reaches this stage, data integrity and transformation are complete and correct.
Data ingestion is the load — i.e., how does the data enter the application. Performance testing for this component should focus on stress and load testing of that process. Such tests should also check that an application processes queries and messages efficiently.
Assessing the performance of the data processing is critical to the overall test. Validate the speed of the MapReduce jobs and consider building a data profile to test the entire end-to-end process.
The analytics that are used to process big data should be performance tested. This is where the algorithms and throughput are validated.
Finally, ensure that parameters, including the size of commit logs, concurrency, caching and timeouts, are included in the performance test strategy.
Challenges with big data performance testing
Like functional testing, the volume and variety — specifically, the unstructured nature — of big data creates potential issues related to performance testing. When these issues are coupled with the velocity and high speed of big data processing, it opens the door for a multitude of issues that testers need to be aware of.
But these very considerations are what make performance testing even more essential. Testers should validate the load, response times, network bandwidth, memory capacity and other analytical components, because issues in any of these areas will result in problems due to the sheer size, volume and speed of big data.
Data processing is comprised of three activities: extract, transform and load. The performance testing strategy must address each of those activities as well as the end-to-end data flow. On a high level, the main components of the performance testing for big data are ingestion, processing and analytics.
Open the tool chest
Many types of tools support big data applications, including tooling for storage, processing and querying. Here are a few commonly used options.
Hadoop Distributed File System stores data across multiple machines, while Hadoop MapReduce provides parallel processing for queries. Also, Apache released Hadoop Ozone, a scalable distributed object store for Hadoop.
Another Apache offering is Hive, an open source data warehouse system that allows data scientists and developers to make queries with an SQL-type language. Pig Latin, a query language written for Apache Pig, helps teams analyze large data sets. It can handle complex data structures and NoSQL, which is often used to query unstructured data.
From a big data analytics perspective, some of the most powerful tools are Tableau, Zoho Analytics and Splunk. Tableau offers an engine that can blend multiple data points and doesn’t require users to know coding to create data queries. Zoho Analytics is user friendly and provides a wide variety of detailed reports. Splunk’s most important feature is its scalability; it can process up to 100 TB per day.
The maturation of NoSQL databases — such as MongoDB and Couchbase — allows for more effective big data mining for analytics. Specialized databases can accommodate specific uses, such as in-memory databases for high-performance and time-series databases for data trends over time.
Testers should throw out test case conventions when they test big data setups. Instead of looking for a specific and known answer, testers should look for a statistical result, so the test cases have to reflect that. For example, if you test the big data a retail website collects, you must design test cases that allow the team to draw inferences on buying potential from all the information regarding customers, their searches, products added to carts, abandonments and purchase histories.
Lastly, testers shouldn’t evaluate test results by their correctness because there’s no easy way to determine that. You might have to break the problem into smaller pieces and analyze tests from each piece. Use technical skills and problem-solving creativity to determine how to interpret test results.
Where testers fit in
As testers, we often have a love-hate relationship with data. Processing data is our applications’ main reason to exist, and without data, we cannot test. Data is often the root cause of testing issues; we don’t always have the data we need, which causes blocked test cases, and defects get returned as “data issues.”
Data has grown exponentially over the last few years and continues to grow. We began testing with megabytes and gigabytes, followed by terabytes and petabytes and now exabytes, zettabytes and yottabytes have joined the data landscape. Welcome to the brave new world of big data testing.