How can we get quality data from the web?

, Precognox


In most cases, contents appearing on the web (as the largest data source) are not suitable for further workflows, since information on websites is available in different formats. More complex tasks are needed to provide the required quality data at the end of the process.

What are the criteria for the data and what processes should be done to get the right quality data after collecting them from the web?

The quality of a data always depends on what extent it meets the requirements. These requirements concern to the data’s
– content
– format
– quantity.

These criteria are always determined by the way the data is used. However, contents on the web may not always meet the above criteria because:
– have different formats
– come from different data sources (different websites)
– are constantly expanding and changing
– are not compatible (not integrable) with further steps of data processing.


, Precognox


It is easy to see that many workflows are required to make data contents suitable for further work.

These tasks include:

– data collection – collecting Internet content from specific websites
– data cleaning – eliminating and removing unnecessary, irrelevant or incorrect data
– data enrichment – expanding the data with data that is not available on the given website
– validation
– annotation
– data conversion – organize and transform the data content to the correct format (JSON, MySQL table)
– transmission of data – providing the right data, delivering it to the user, in a safe way (password protected access).

The “production” of quality data out of the Internet content is a very complex process and it also requires knowledge of a number of fields (programming, software development, data science, linguistics or even artificial intelligence). It is definitely worthwhile to choose a complex solution that can fully meet the challenges of the task, so that the user will have quality data at the end of the process.

The Data Collector (and the Precognox TAS text analytics platform) has been developed to integrate all of the workflows described above.
The partner only has to mark the relevant webpages and contents (inputs) and criteria on the output side, the Precognox team performs the entire process. That way you will have quality data from the web.


Pictures: Pixabay

If you liked the article please share it with others!