How To Scale Your Web Content Harvesting Operation
September 28, 2018
To develop a high-volume web content harvesting operation, businesses have three technical options; cloud-hosted SaaS software, on-premise software and outsourced or managed services. Each typically uses the same underlying technology.
Let’s look at three different options for content harvesting implementation and important considerations and merits of each for high-volume data extraction projects.
Web Content Harvesting Technology: What to look for
Whether your organization manages content harvesting operations internally using software or partners with a managed service provider, you want to make sure that the underlying technology is highly capable. Here a few features to look for:
- Point-and-click interface and dashboards: Intuitive features that simplify agent creation and script writing (obviously not applicable to managed services)
- Robust error handling capabilities: Websites make updates and release new code all the time. These agents need to be smart enough to manage those changes or pause in time to make corrections.
- XPath capability: Essential to extracting web data from complex websites
- Ability to capture data from complex and dynamic tables
- Ability to move from page to page automatically
- Automatic data list detection
- Ability to also extract data from formats such as Excel, Word, and PDF
- Capturing name-value pair lists
- Exported to commonly used formats
- Customer support: That it’s knowledgeable and available is a given
Let’s look at the three ways these technical features can be delivered.
Cloud-Based SaaS Software
Cloud-based software as a service solutions are best for companies with ongoing projects or high-capacity, enterprise-level needs. To take advantage of cloud software, companies must have one or more employees or contractors to learn the web scraping agent creation software and manage agents over time.
A developer or coding specialist is not always needed to create agents. The technical abilities and expertise of the individual or team creating the agents depends upon the complexity of the target websites. If the sites are comparatively simple, users with little to no technical background can learn the software’s point-and-click interface, then create and manage the agents.
Hedge funds, banks, government agencies and healthcare organizations, as well as organizations that need to harvest Intranet data or data affected by privacy regulations and restrictions, have one choice. Extraction volume is secondary to the necessity of having a secure and self-hosted environment. This means On-Premise licensed software.
It’s the best choice for companies with the following resources and characteristics:
- Ability to host hardware and software in-house
- Employees capable of creating agents, managing agents and validating alternative data
- Employees who can take over the management of web content harvesting operations should a key staff member leave or get promoted within the company
- An infrastructure that supports enterprise-scale content harvesting workloads, including alternative data repositories
- Confidence that they can maintain high levels of productivity while their employees manage ongoing content harvesting needs
Even large, sophisticated enterprises may have trouble fitting these criteria. When companies with an on-premise solution and dedicated content harvesting personnel find themselves in a project that’s too complex for even experienced staff members to handle on their own, outsourcing is the way to go. Keep reading to learn about managed data services.
Managed Data Services
Outsourcing your web content harvesting to a company like Mozenda means that someone else (in this case, us) is responsible for delivering exactly what you need when you need it. Outsourcing is equally appropriate for a one-time project or an ongoing content harvesting program or data feed.
If your company is concerned with the following, a Managed Data Services option is the best:
- Employees who believe time-consuming content harvesting tasks prevent them from handling their core job responsibilities
- Current staff members are unfamiliar with any or all major facets of high-volume web content harvesting: creating and managing many agents across multiple websites, following web harvesting best practices, performing high-volume site scrapes, etc
- Insufficient budgetary availability to hire specialized staff or train existing employees on web content harvesting tasks
- The inability to organize harvested data in structured formats such as XML, CSV or TSV
- Insufficient technological foundation for enterprise-scale web content harvesting processes that involve the extraction of vast quantities of alternative data multiple times a week or even daily
- Accuracy and completeness of regularly updating agents and ensuring collected data
- Harvesting needs are infrequent or one-off projects
- Development and staffing of an in-house content harvesting operation
If you’re considering outsourcing your high-volume web data extraction needs to a vendor, ask the vendors under evaluation the following:
- Can the company scrape highly complex sites, especially large numbers of sites utilizing blocking methods?
- Can the vendor complete time-sensitive data harvesting projects to the specification and on time?
- Does the vendor support your preferred data format so extracted data is readily shareable with all relevant stakeholders?
If your company is unsure of which Managed Data Services vendor to choose, request a proof of concept. Ask the company to give you a small sample of exactly what you want.
Moving Forward with High-Volume Web Content Harvesting
Scaling your web content harvesting program isn’t a completely linear exercise. If you choose Cloud-based software or On-Premise software, you’ll likely have several questions as you get started, but that’s what the Mozenda Support team is for. If you choose Managed Data Services, the company you choose to scrape data will have the responsibility of resolving any issues your personnel may have faced.
We suggest moving forward with your high-volume content harvesting strategy by speaking with one of our Customer Support representatives below. They will give you an educated answer on which operation is best for your data scraping needs.Free 30-Day Software Trial