How to build scraping data agent

newagent · May 5, 2025, 10:54am

I want to build a simple agent that searches google and finds the websites for example top ten Hvac service providers in new york.
Then the agent should get the email, phone, name and url of the website of those hvac service provide.
then it needs to format data in json and send it to a rest endpoint.

How i can build it?
By the way the discord link is not working.
Thank you

seanp_ai · May 5, 2025, 3:13pm

Hi can you try this url for Discord ? Quick QR Art

seanp_ai · May 5, 2025, 3:17pm

Example template: Agenticflow AI | AgenticFlow

Here’s a step-by-step breakdown using the best nodes/MCPs:

Search Google:

Start your workflow with the Google Search node.
Input: Set the search_query to something like “top 10 HVAC service providers New York”.
Output: This will give you a list of search results, including URLs.

Identify Business Websites (Optional but Recommended):

Sometimes Google results include directories, ads, etc. You might want to add an LLM node here.
Input: Feed the search_results from the Google Search node.
Prompt: “From this list of search results, identify and return only the URLs that seem to be the official websites of actual HVAC service provider businesses in New York.”
Output: A cleaner list of target website URLs.

Scrape Each Website:

You’ll likely need to process each URL individually. While our basic Web Scraping node might work for some, scraping 10 different sites reliably often requires a more robust tool.
Recommended: Use the Apify MCP (https://agenticflow.ai/mcp/apify). You’d likely loop through the list of URLs from the previous step and use Apify’s “Website Content Crawler” Actor (or similar) via the MCP’s “Run Actor” action to get the text content from key pages (like Homepage, Contact Us) of each site. Note: This requires an Apify account and potentially credits there.

Extract Contact Info & Format JSON:

Add another LLM node .
Input: Feed the scraped text content from one website (within the loop from the previous step).
Prompt: “From the following website content: {{scraped_content}}, extract the business name, their primary email address, their main phone number, and the website URL. Return this information ONLY as a single JSON object like this: {"name": "…", "email": "…", "phone": "…", "url": "…"}. If any piece of information is not found, use null as the value.”
Output: A JSON object for each company. You’ll need to aggregate these JSON objects (often done in the next step or using list manipulation features if processing in a loop). We are improving list handling in workflows.

Send to REST Endpoint:

Add the API Call node - search for the specific node if needed.
Input: Configure the node with your target REST endpoint URL, set the method to POST (or whatever your endpoint requires), and map the aggregated JSON data (from step 4) to the body input. Configure any necessary headers (like Content-Type: application/json or authentication tokens).
Action: This node will send the structured JSON data to your specified endpoint.

Building It: You’d assemble these nodes sequentially in the Workflow builder canvas, connecting the output of one step to the input of the next. The looping for scraping/extracting per website might currently require a slightly more advanced setup (e.g., triggering the workflow multiple times via API for each URL) until more sophisticated loop/iteration nodes are released.

Let me know if you hit any snags putting this together!