What is TAS Data Collector?
By TAS Data Collector the user can download unstructured data (textual content) from the Internet by structuring the content, making it accessible to other information systems, and suitable for further processing, analysis or visualization.
The content collected by the TAS Data Collector can be utilized immediately or can serve as a basis for text analysis workflows that can be implemented with other build-in modules of the TAS Platform.
Data collection workflow
- data (textual content) of webpages (or subassemblies) specified by the customer are collected by the service
- further steps (data cleaning, data enrichment, validation) are implemented under the supervision of our specialists
- as a result, a structured database is created that can be used for further data processing (analysis, visualization) or serve as a basis for further text analytics solutions
- providing and transferring the collected, properly formatted content to the customer (even through an authenticated, password protected channel)
Features of the TAS Data Collector
- TAS Data Collector is able to extract the visible data, metadata (tags, picture description) or pagination from a website.
- Sites, subpages, login-required pages, even hierarchical sites or pages with a slideshow component or with multilingual content also cause no problem for TAS Data Collector.
- When data is recognized as hidden, we offer a screenshot solution (the original exact look of the data).
- In some cases it is forbidden by robots.txt to collect data. We respect this; however, this data is also possible to collect.
- We can extract texts from a lot of different documents and image formats (PDF, spreadsheet, diagram or image file formats).
- We are prepared to produce and deliver any required output format, even ones that require software development.
Important! Please consider that we are not responsible for the further utilization of the collected data.
What can the collected content be used for?
- research and development projects
- new content and publications
- service, information, thematic sites, blogs, public interest and open data portals
- analyzes, statistics, visualizations
- enterprise processes / operations, data backup
- competitor and media monitoring
- searchable databases
- artificial intelligence, machine learning processes
- data change monitoring
Appearance of TAS Data Collector
The TAS Data Collector GUI provides the ability to monitor the downloading stream. The appearance of the interface matches the corporate identity of the TAS Platform.
The interface provides information about:
- resources overview: which are wired, how many records are received
- the number of valid and broken records
- overview of the total number of records
- the date of the data collection
TAS Data Collector technical description (requirements, integration, open source software used)
Initial Resource Requirement (On Premise – For Onsite Installation):
x86_64 CPU with at least 4 cores
at least 16GB RAM
35GB hard drive (storage may increase in some cases)
64-bit Linux, Windows, or macOS – 64-bit JDK 1.8 or higher
Accessibility and platform support for developers
Cloud API – On Premise API – Java SDK available