Collecting data from Internet is possible in several ways. We may try to find and download the data ourselves or we can choose from ready-made solutions. The decision is influenced by several factors.
The amount of content to be collected basically determines the method we choose.
Of course, if we collect a smaller amount of data, we can do it ourselves. Just think of the case when we only want to know the winning lottery numbers of the last month and organize them into a database. In this case, the task can be solved with basic level knowledge. But what if there were thousands of articles on a leading science page to be the basis of a study or a development project? In such cases, mass collection is required, which needs a targeted solution.
Data quality is also an important factor in data collecting and has an impact on what kind of solution we choose.
Data quality shows what extent the data meets the requirements.
Such requirements may be like these:
– each record must include the date of publication
– every article should be about taxation
– there should be no duplicates
In many cases, besides collecting, it is necessary to perform other processes – data cleaning, data enrichment – in order to meet the expectations. Such criteria may include, for example, the file extension or the format required by the following workflow (eg visualization with a business intelligence or statistical tool, like Tableau, RapidMiner, Power BI, Google Data Studio or IBM SPSS) or standardizing the format of date from different sources.
In some cases, the downloaded data becomes valuable only by appending information from another source (for example, in the case of financial data, the conversion of the current foreign exchange rates if the orignial website does not calculate with the expected currency).
The complexity of accessing content also limits the number of optional data acquisition tools
The different structure of the Internet pages (dynamic and scrolling pages or tricky pages, sites with required log-in) may make data collecting very difficult. Data collection is also negatively affected if it is only possible by downloading multiple data sources simultaneously.
In addition, the pages and contents marked by robots.txt should be considered.
Today, there are open source tools such as Scrapy or Mechanical soup. These are definitely useful, free tools, but they are not suitable for collecting content from dynamic sites, which is a serious problem, because at the moment the proportion of such sites is almost 30% on the web. Thus, the use of these solutions does not guarantee outstanding results, and increasing their efficiency requires serious developer of programmer knowledge.
Ready-made solutions are available on the international market with a few hundred dollars per month, such as Diffbot and import.io. These are especially user-friendly and customizable tools, but they are not suitable for individual and higher demands, and they can only provide extra services at higher subscription fee rates. In addition, these services do not provide related text analytics solutions (eg, an enterprise search engine to search in the collected text content).
Devices such as Precognox TAS Data Collector offer the perfect solution if you need to collect larger amounts of web data with high data quality and even the availability of data is a major challenge. Complex requirements can only be met by such a specialized data collection solution.
In addition, Data Collector is a part of a complex text analytics platform (TAS), providing unique solutions not only for collecting data but also for working with these contents. All this is done with very reasonable pricing, as it is available from a few tens of dollars per month – in case of repeated data collection). The IT background, software development and text analytics experience of Precognox behind the TAS – Text Analytics System guarantee the high quality of the unique solution.
Before choosing the right tool for downloading data from the Internet, it is essential to determine the amount of data, the features of the source page, and the criteria for the expected data. In addition, the availability of the data source and the additional text analysis workflows with downloaded content should also be considered. After clarifying the requirements, we can easily choose the right tool, whether it is an open source, a compact or a unique solution.