Let’s take a break from cloud providers and see this great tool, available right on your desktop!
Again, this is the problem to resolve…
The Titanic dataset is a classic classification problem: the dataset contains several info about the passengers (age, sex, ticket class and so on) and the survival (yes/no) target value.
The goal is to train a model to predict if a given passenger survived or not. The data is not ready “as is”, so we’ll need to preprocess it a bit.
Orange is “an open source machine learning and data visualization for novice and expert. Interactive data analysis workflows with a large toolbox.”
It can be downloaded from here or installed directly using Anaconda Navigator (the graphical conda packages and environments manager)
Let’s see it in action!
Fist, let’s create a new project..
This is the interface, let’s dig a little bit more
On the left, there is an accordion, with various sections (Data, Visualize, Model, Evaluate, Unsupervised), each containing “blocks” that can be used and linked together to perform complex actions.
In our case, we want to load the dataset, perform some analysis, cleaning and then choose a couple of algorithms to evaluate their performance (as usual, we focus on tool not on how to obtain better results).
Let’s start to load the file, using the “File” block. Once selected, the data types are automatically detected and there is the possibility to pick the target feature (in our case, ‘Survived’).
Let’s take a look at the data: to do this, just drag and drop “Data Table” and connect to the previous block. There is some info available and is possible to sort by columns
Let’s visualize something, for example distributions. To do so, let’s drag and drop “Distributions” in the Visualize toolbox, connect it to the Data Table block and let’s select Age
A distribution curve is shown and is possible to group by a categorical feature, to perform further analysis. Let’s see survived respect sex
Super easy! Now let’s remove some columns and make numeric the categorical features.
As you can see, it’s very intuitive to do and to grasp at first look all the steps. Time to train and evaluate a model: let’s split the data in train and test and build a Logistic Regression model.
Every step can have an associated form to configure some parameters: for example, the Data Sampler allow to choose a method to sample the data (fixed, cross-validation, etc)
Now it’s time to evaluate the model: we’ll use “Test and Score”, “Predictions” and “Confusion Matrix” to see how the model is performing…
Great, everything at your fingertips. And if you change something at any point, for example selecting a subset of data or changing some parameters, everything is automatically updated! This is really useful to do fast experiments, trying for example different models and hyperparameters.
Let’s add another algorithm, a Tree algorithm and let’s find out if performs better than LR
Performances are similar, as we can see from F1 score and ROC Analysis.
I really love Orange and you should too :)
It’s very easy to use, exceptional to visualize concepts and data (can be used as a training tool) and can be extended with specific add-ons.
So, if you have to experiment on some data, give it a try!
See you next time