10 successful big data sandbox strategies
Being able to experiment with big data and queries in a safe and secure “sandbox” test environment is important to both IT and end business users as companies get going with big data. Nevertheless, setting up a big data sandbox test environment is different from establishing traditional test environments for transactional data and reports. Here are ten key strategies to keep in mind for building and managing big data sandboxes:
1. Data mart or master data repository?
The data base administrator needs to make a decision early-on as to whether to have test sandboxes use data directly from the master data repository that production uses, or whether the best solution is to replicate and splinter off sections of this data into separate data marts that are reserved for testing purposes only. The advantage of the full data repository is that testing actually uses data that is used in production, so test results will be more accurate. The disadvantage is that data contention can be created with production itself. With the data mart strategy, you don’t risk contention with production data—but the data will likely need to be periodically refreshed to stay in some degree of synchronization with data being used in production if it is going to closely approximate the production environment.
2. Work out scheduling
Scheduling is one of the most important big data sandbox activities. It ensures that all sandbox work is optimally being run. It usually achieves this by concurrently scheduling a group of smaller jobs that can be completed while a longer job is being run. In this way, resources are allocated to as many jobs as possible. The key to this process is for IT to sit down with the various user areas that are using sandboxes so everyone has an upfront understanding of the schedule, the rationale behind it, and when they can expect their jobs to run.
3. Set limits
If months go by without a specific data mart or sandbox being used, business users and IT should have mutually acceptable policies in place for purging these resources so they can be put back into a resource pool that can be re-provisioned for other activities. The test environment should be managed as effectively as its production environment counterpart so that resources are called into play only when they are actively being used.
4. Use clean data
One of the preliminary big data pipeline jobs should be preparing and cleaning data so that it is of reasonable quality for testing, especially if you are using the “data mart” approach. It is a bad habit (dating back to testing for standard reports and transactions) to use data in test regions that is incomplete, inaccurate, or even broken—simply because it was never cleaned up before it was dumped into a test region. Resist this temptation with big data.
5. Monitor resources
Assuming big data resources are centralized in the data center, IT should set resource allowances and monitor sandbox utilization. One area often requiring close attention is the tendency to over-provision resources as more end user departments engage in sandbox activities.