How can I use web-based data in RapidMiner?


RapidMiner is a data management platform that provides an integrated environment for data preparation, machine learning and predictive analysis. In addition, RapidMiner is also capable of visualizing the contents of corporate internal data sources.
But what could we do if we want to use external data (available on the Internet) within RapidMiner?
The solution is Precognox’s TAS Data Collector service, which collects unstructured data from the Internet and organizes this data into a structured database. This MySQL database is already suitable for working in RapidMiner.

The process is described below step-by-step:

1. Selecting and downloading a particular web page (as a data source)


In the first step, we download the selected web data (in our example, the content of the Keresővilág Blog website) using the Data Collector service. However, in the case of unstructured data (text content) on the web, downloading is accompanied by a number of tasks (data cleaning, validation) that are accomplished by our data specialists. As a result of these workflows, a structured database is created, which will be continuously updated in the future, so that the data is always available and usable.
The user gets access (server data, user name and password) to the structured database through a secured, password-protected channel.

2. To load the data in the RapidMiner interface, click Import Data, then click Database:


Click New Connection

Here you can add the details to connect to the database, (it is given by us for the user beforehand), then click OK

3. Select the database, then click Next

4. Once the data is loaded, it can be managed and visualized as usual

example of a completed visualization

In addition to providing insight into the content of the web-based data source (chosen website), the visualization is a great business asset as it can be a part of presentations, business reports, evaluations, or even competitive analyzes. In this way, the potential of huge amounts of data on the Internet can be exploited.

Structured databases provided by Data Collector – thanks to integration – can be visualized with most known business intelligence tools (Tableau, Power BI, Google Data Studio, IBM SPSS).


