-
Inability to process large volumes of data
- Inability to scale
- Difficulty in operationalization
Did you know that Teradata enables you to solve these problems with your language of choice by processing data directly within a Vantage system? Taking the analytic processing into the Teradata Vantage platform has tremendous processing and performance benefits, namely:
- R/Python code can leverage Vantage’s massively parallel platform (MPP) for performance and scalability,
- Data resident in the Vantage ecosystem need not be moved to another analytic processing platform, and
- It provides a platform for operationalizing analytics.
Why is this important?
Teradata Vantage was built from the ground up for efficient analytics. Over the last decades, facilities have been added to the evolving core platform that enable users to bring analytics to the data. The design of the shared nothing architecture* of Vantage allows much more than the simple storage and reporting of voluminous detailed data records. Additionally, Teradata Vantage now includes a Machine Learning Engine that provides multivariate statistical, machine learning and graph functions to existing core capabilities. Together, these engines allow the execution of advanced analytical tasks directly on the data without data movement, a processing capability Teradata pioneered in the late 1990s and continues to evolve today.
How is this done in Vantage?
2. Languages are running directly on the Vantage platform.
Let’s take a look at both of these approaches.
Client-Side Language and Packages
-
Context, connection and database management interfaces.**
- R/Python interfaces for Vantage Machine Learning, Graph and Advanced SQL Engine functions.
Server-Side Languages & Packages
For approach (2), it is important to understand Vantage’s shared nothing MPP architecture. On the Vantage system, data is evenly distributed across all its virtual units of parallelism, known as Access Module Processors (AMPs). To enable R/Python processing, the respective interpreters, base packages and any desired add-on packages need to be installed on every node.*** Vantage’s Table Operator mechanism drives execution, and the processing is simultaneously performed on every AMP against the data available to that unit of parallelism. The number of AMPs per node varies from 30 to 45 units, based on the version of the Intel CPU and overall node performance. Thus, a ten-node system with 40 AMPs per node will run 400 parallel processes of R/Python.
Notably, each instance of R/Python is operating independently, with no inter-process communication across nodes or AMPs within a node. Hence, data scientists must make a script cognizant of the data available to its running instance. On this basis, the following use case types can be addressed:
- Row-Independent Processing – Depends only on the input from individual data rows on a single
- Partition-Independent Processing – Depends on the input from individual data partitions on a single AMP. Examples include model fitting for a given location, time period or product.
- System-wide processing – Based upon the entire input table which is evenly spread across every system AMP. In this situation, additional design or programming may be needed. Examples include calculating a global average or building an attrition model for the entire customer base.
Out of the box, Vantage provides support for the first two processing patterns. For system-wide processing, the data scientist must construct a master processing level to combine and appropriately process the partial results returned from every AMP process.
Start Optimizing Your Data Science Process
Whichever your preferred method, with Teradata Vantage you can use R and Python while taking advantage of its massively parallel platform (MPP) for performance and scalability. If you’re on a previous version and curious about upgrading to Vantage, contact us today.
* In a shared nothing distributed computing architecture, each processing node is independent and self-sufficient. The nodes share no memory or disk storage, and there is no single point of contention across the system. so that the maximum performance and scalability is achieved.
** Both tdplyr and teradataml make R Data Frames and pandas Data Frames appear locally to the programmer but are virtually pointing to tables or views in Vantage.
*** Vantage, makes the installation process easier by providing bundles of R/Python base and add-on packages that have been tested against the base operating system and vetted for security and legal constraints.