Pyplan is a platform designed for users without programming skills to build and share Data Analytics and Planning applications.
The construction of an application requires as a first step the input of data, which can be manual or through the reading of an external data source.
The manual data input is constructed by dragging a node of the type:
This type of node, once a title has been defined, opens a wizard that allows you to define the type of manual data to be entered.
This is the input used to enter a single parameter. Once the node title has been defined, it is represented in the diagram as follows:
It is used to enter the different choice alternatives that will be presented in a drop-down selector. Once created, it appears as follows in the diagram:
The form is the most powerful and versatile manual data entry tool since it allows combining data entry columns together with calculated columns that serve as a reference or guide for the data being entered. For example, if we are creating a tool to enter data for a sales budget, it may be helpful for the person who is going to enter that data to have the previous year’s sales as a reference. The form is stored in a database, thus allowing multiple users to enter data simultaneously.
When dragging a data entry node, and after having chosen Form as input element, the following wizard for its creation is displayed:
By defining a title for each input field, a field name suggestion is generated, then the column type according to the options shown in the drop-down box.
A data cube is a particular input object that allows the input of a single parameter for all combinations of cube dimensions. This is why for its definition it is necessary to indicate which are the dimensions of the input data cube. Its use is indicated when you want to emphasize the data entry in all elements of the opening dimensions.
Another way to enter data into Pyplan is by connecting to external data sources, which are read when the corresponding code is executed.
For this purpose, drag a node named data reading that will display, after defining its title, a dialog box like the following one:
The most frequently used options (csv, Excel) also have a specific wizard that allows you to configure all the reading parameters.
Other less frequently used options are initialized with the base code which, after completing the necessary parameters, allow the corresponding data reading.
Once the data entries have been generated, the next step is their analysis and processing, for this purpose the variable type nodes are used. This type of node is the most general of all and allows any type of coding to be included in its definition. When dragging and dropping a “Variable” node, we will be asked to define its title and then we will obtain a node with a definition like the following one:
result = 0
To link this new node with another one that is its data source, we can write the node’s identifier (Id) in its definition or once positioned where we want to insert the call to another node, pressing the <Alt> key we click on the node to which we want to link it, this will bring the Id of that node to the definition.
Accepting the changes we will see that an arrow appears indicating the link between nodes and the color of the variable node changes to “Gray” to indicate that this process has no other output beyond the node itself.
The Variable type node allows writing Python code freely in its definition. However, Pyplan provides a series of wizards that help to perform data transformation operations through the use of interfaces prepared for this purpose.
These wizards depend on the data structure we are working with (object), this is why we need to evaluate the node first so Pyplan can determine the wizards it will present us to work with.
These wizards are identified as “Wizards” and are displayed once the node has been executed, by clicking on the icon shown in the following figure:
When using these wizards, you will be able to observe how the node definition code changes with the appropriate instructions to generate the desired process. This procedure, equivalent to recording Macros in a spreadsheet, allows the user who does not know the Python language to be introduced to its functions and syntax.
The indices or dimensions are the way in which the data is structured. That is, they are the row and column headings of a table that describe what the value we see at their intersection is about. Examples of indexes are the list of products, regions, time periods, etc. They serve to characterize the data or facts with which we operate.
In Pyplan, the indexes are generated by dragging a specific node type for this purpose identified as
After entering a title for the Index, a wizard is displayed to define the elements of the index.
Allows manual entry of the index elements. It is also possible to copy them from a table and paste them indicating their first position. The data range will be extended if it is greater than the number of elements displayed.
Allows to generate the index automatically by defining the parameters of a range. This range can be of type text, number or date
Indexes can also have hierarchies, i.e. higher levels of aggregation. For example, the natural hierarchy of a Country index is Region or Continent, or that of a Month index is Quarter, Semester or Year.
The correspondence between an index and its upper hierarchy is established through a table where for each element of the lower hierarchy its corresponding element in the upper hierarchy is indicated. The following image illustrates the case for an index “Countries” and its upper hierarchy “Regions”
By right-clicking on the lower hierarchy index, its properties indicate which table contains the correspondence relation with the upper hierarchy, following the example:
Any index containing a hierarchical relationship is identified with an icon inside the node as shown in the following image in the Countries node
The diagram or “workflow “ is the way the code is organized in Pyplan. A general convention to aid readability is to keep the direction of the arrows / flow of information from left to right and from top to bottom. In addition to headings to summarize the process or information housed in a node, it is possible to include text boxes to help interpret a set of nodes.
As a general rule it is desirable not to have more than 20 nodes in a diagram. Whenever this happens it is recommended to use “Modules” to group nodes whose process share a specific purpose and therefore can be grouped together.
In the above diagram we could create a module called “Outputs” (1) that groups the 3 output nodes. And then cut (2) and paste (3) the output nodes into the new module.
These 3 steps are described in the following image:
Nodes are automatically colored to facilitate the understanding of their purpose and function.
All nodes keep their original color, which is the one displayed in the menu from where they are dragged, except for the Variable node. This node can take three colors according to its function:
A node can have two states: Not Calculated or Calculated.
When opening an application all the nodes are pending execution, until some command indicates it. When a node is commanded to compute (execute), Pyplan recursively goes through the entire influence diagram asking if the nodes that feed the node to be executed are computed, if not, it goes one step back in the computation process asking the same question.
Once the application boundary or the boundary of the computed nodes is reached, it starts to compute downstream in order to finally present the result of the queried node.
This process guarantees that the result of a node when calculated is always the same and that its value does not depend on the execution sequence of the preceding nodes.
On the other hand, this mechanism provides a lot of computational efficiency since changing any intermediate variable in the calculation guarantees that only those nodes whose value has been affected by the change in the mentioned variable are recalculated.
The status Calculated / Not calculated can be inspected when selecting a node. In the result view it will show the output of the node if it is calculated and otherwise a message indicating that the node is not calculated.
Pyplan natively interprets some data structures such as Tables and Cubes coming from the most widespread Python libraries (Pandas, Nump, etc.).
Data tables are the typical structure of a database, with attributes defined by columns, where each row corresponds to a record.
Data cubes can have any number of dimensions. These dimensions in turn can be nominated or undefined.
Most commonly used data structures
A table is similar to a database table, i.e. it is a data structure where each column represents an attribute or measure and where each row corresponds to a particular record of those attributes or measures.
The data tables in Pyplan correspond to the Dataframe object of the Pandas library, one of the most used libraries in Data Science. Some basic functionalities of operations with Dataframes are provided by the Pyplan wizards. There are however many other operations that can be performed through Python coding using the Pandas library.
Pandas Quick Introduction
A quick introductory guide to Pandas functionalities can be found here.
A node that when evaluated returns a Table (DataFrame) as a result will present the following options as assistants:
Continuing with the example, if we create a variable “first variable”, we change its definition by linking it to the node data such that:
result = data
after executing this node to allow Pyplan to identify the resulting object, we deploy the wizards and choose sort “by Year”
We will see that the result is the sorting by year of the table and its final code is defined as follows:
The user could continue to interact with the data object and analyze the changes it causes in the node definition and thus learn the Python language.
The data cube is also an object natively supported by Pyplan. The object used is the DataArray from the XArray library. A named data cube is a data structure containing a values indexed by n-identified dimensions. These dimensions in Pyplan are called Indexes and are identified by this type of node
in the diagram.
For example, we could think of a “Count of Win” data cube indexed by the dimensions [Tournament Country, Surface, Year]
This cube, following the example developed so far, would be:
Data cubes are created by transformation of tables (Dataframes) into data cubes, by direct inputs (Input Table), or by operations between cubes.
Right-clicking on a node that is evaluated as a data table displays the following menu:
Clicking on the “Create Cube “ option displays the following dialog box:
It lists all the columns of the dataframe, the type of data, the index (if it exists) that will collect the values of that column, the option to create an index in case it does not exist and finally the aggregation function that will be used to group the value of the fact or variable to be represented.
To create a data cube in Pyplan it is necessary to have, in advance, the dimensions (Index) that characterize the data cube. It is for this reason that the wizard offers us to create the indexes based on the column data in case it does not exist.
As with Table type objects, Pyplan provides wizards to operate with Data Cubes, which are automatically deployed in the same place, when the resulting object is a Data Cube of type XArray.
Additionally, unlike Tables, Data Cubes allow mathematical operations between them that result in new Cubes. It is important to understand how these operations between data cubes work in order to build the desired calculation process.
When adding, subtracting, multiplying or any other mathematical operation on a data cube, it is performed between the scalar and each element of the data cube.
The following image shows the definition of the node it establishes:
result = cube + scalar
The value of the scalar, 5 in this case, is added to each element of the original cube giving as a result:
In case the cubes have the same dimensions, the indicated operation is performed between the elements of the same cells of both cubes
In this case we can see how the first cell of Cube C is the result of the sum of the first cell of Cube A plus the first cell of Cube B.
When operating with cubes of different dimensions, the missing dimension in one of the cubes operates scalarly on the other cube. An example allows to better explain this way of operating:
In the example above you can see how Cube B, which does not have the dimension Region, is used in a scalar way with respect to this dimension when operating with Cube A.
The complete list of operations with Cubes can be found in the ``documentation of the XArray library. <https://docs.xarray.dev/en/stable/user-guide/computation.html>`_
To visualize the dataframe manipulation wizards, with the node executed, open the node menu and select the "Handling data" option.
This will display a new widget with the dataframe result in a paginated table format. The table headers will show both the column name and its data type.
The wizards are divided into two main groups. On one hand, there are wizards that impact the entire dataframe:
And, on the other hand, there are wizards that only impact at the column level:
We will first look at the wizards that impact the entire dataframe.
The first available wizard in the list is Select columns, which will show us the list of all columns present in the dataframe along with the data type and a column indicating whether it is an index or not.
Once the desired columns are selected, we confirm and the changes will be reflected in the new widget table.
These changes are not applied to the original node until confirmed. So, if you wish to cancel the changes, simply click on cancel to close the widget and the node will remain with the original values.
On the other hand, if you wish to confirm the changes, they can be applied to the original node or to a new node that will be created as the output of the original node with the modified values.
When we confirm on the original node, the widget closes, and the original executed node is selected with the new changes applied to a copy of the dataframe.
When we confirm to generate a new node, the widget also closes, but this time the newly created node is selected and executed. The code generated by the new node will be as shown in the image.
The next wizard is the index selection wizard, which, as its name suggests, allows us to choose one or more indices from the dataframe columns, as shown in the following sequence:
Generated code and view of the generated table by index selection wizard:
Next is the filtering wizard. This wizard allows filtering one or many columns with the chosen conditions and values. Each row in the table will be an additional filter added to the dataframe, separated by the "Or" or "And" condition, depending on the selected value.
Generated code by the dataframe filtering wizard:
Continuing with the list, we have the group/aggregate wizard. As its name indicates, it will open a list with the available columns for aggregation and grouping. Columns with a numeric data type will be the only ones with the possibility to aggregate; the rest can only be selected for grouping.
Generated code by the dataframe group/aggregate wizard:
We continue with the wizard for adding a calculated field as a new column in the dataframe. For this wizard, it is necessary to add a new column name and the value for each of its rows given by the calculation between other columns or any entered calculation. The image shows the total lost points as a new column calculated using the value of each row from the column "player1 total points total" minus "player1 total points won". The new column is generated at the end of the table.
Generated code by the calculated field wizard:
Advancing in the list of wizards that impact the entire dataframe, we find the sort dataframe wizard. As illustrated, it allows selecting one or more columns from the dataframe and defining the order (ascending or descending) by which the data is to be viewed.
Generated code by the sort dataframe wizard:
The last available wizard for this group is the one for dropping duplicate values in the dataframe rows. Just like previous ones, it shows the list of all columns in the table on which the pandas drop_duplicates function is to be applied.
Generated code by the drop duplicates wizard:
Now we continue with the wizards that have an impact only at the column level. These wizards are available for each column of the table and have a local impact, meaning they do not affect data in other columns.
We start with the first on the list, which is the column renaming wizard. This wizard allows you to enter a new name for the selected column, as shown in the sequence:
The generated code for renaming the column is as follows:
Next is the column delete wizard, which is very straightforward. Simply confirming the action deletes the column from the table:
The generated code for deleting the column is as follows:
Next, we present the wizard for duplicating a column. This wizard creates an exact copy of the current column but with the entered name:
Generated code for duplicating a column:
The next wizard in the list allows changing the data type of the column. The data type is shown below the column title.
The options for changing the data type are:
When the date type is chosen, there is the possibility to add a format to the date type to be displayed.
Generated code for changing the data type:
We continue with the column sorting wizard. This wizard is a simplified version of sorting at the dataframe level, where only the order (ascending or descending) of the selected column is chosen.
Generated code by the sort column wizard:
Another wizard in the list is the column sanitization wizard. This wizard allows applying one or more actions to the entire column, given by the list of options, so that the data present in the column is displayed and processed more cleanly. You can choose Apply changes in a new column or let them in the current by checking checkbox in the bottom of wizard.
Generated code by the column sanitization wizard:
The last wizard in the list is the fill empty values wizard. This wizard, as its name suggests, will fill in the values found empty in the column with a default value. This default value can be chosen in different ways:
The first way is by directly writing the desired value in the field called "Fill value".
Then there are two filling methods that are supplementary to the previous case. These are forward fill ("ffill") and backward fill ("bfill"). Ffill will fill the null values as follows: it traverses the list of values, and when the next value in the list is NaN, it fills it with the current one. Bfill traverses the list in the opposite direction, and the filling of values is done in the same way as before, but the values will be different since the non-null value found before a null is another.
Generated code by the fill empty values wizard:
When a wizard is confirmed and the table changes its content, the history of changes made is stored internally. This allows for undoing a change and redoing another change.
One functionality that is included, but is not a wizard, is the ability to move the position of columns. This is done by selecting the desired columns and dragging them to the required position. This has a direct impact on the node definition and does not require any additional confirmation.