Automating entity schema generation

Blog / Automating entity schema generation

Structured data, particularly in the form of schema markup, has become essential for ensuring that search engines like Google can interpret content correctly and present it in an enriched format. But creating structured data manually can be time-consuming and prone to error. That's where automation comes in.

Our team recently developed a tool designed to automate schema generation by leveraging the power of OpenAI's GPT-4 model. This streamlines the process, making it easier for websites to implement the latest SEO best practices without the manual grind. Let’s dive into how it works and what makes this tool a game-changer for SEO professionals.

How the Schema Generator Works
The core functionality of this tool lies in its ability to automatically generate two key types of schema:

About Schema – This schema is used to describe the primary entities that a webpage is about, based on its title, meta description, and headings.
Mentions Schema – This schema extracts entities mentioned throughout the entire page's content, ensuring thorough coverage of all relevant information.
Here’s a step-by-step breakdown of how our tool works:

1. Extracting Text from Web Pages
The tool first extracts key pieces of text from a webpage, whether that’s the title, meta description, or heading for an “about” schema or the entire page for a “mentions” schema. This ensures that we capture the most contextually relevant content.

2. Feeding Text into the GPT-4o-mini Model
Next, this text is used as a prompt for the GPT-4o-mini model, which is tasked with identifying the necessary entities. Using OpenAI's structured output feature, the model generates a neatly parsed list of entities, each accompanied by a short description.

3. Embedding Entities for Semantic Understanding
The entity names and their descriptions are then passed to an embedding model. This model transforms the entities into vectors—numerical representations of their semantic meaning. The goal here is to ensure that the tool can "understand" the entities in context, rather than merely processing them as strings of text.

4. Matching Entities with Wikipedia Pages
The tool calculates the dot product between the entity vectors and a matrix of vectors, each representing a Wikipedia page. If the highest match exceeds a pre-defined threshold, the tool accepts that Wikipedia page as a valid link for the entity.

5. Formatting into Schema Template
Finally, the validated entities and their Wikipedia links are formatted into a structured schema template and returned, ready to be implemented on your website.

Overcoming Challenges in Automation
Building this tool required overcoming several technical challenges. Here are a few noteworthy hurdles:

Extracting Only Visible Text: To avoid processing irrelevant data, our tool focuses on extracting only the visible text on the webpage. This requires interpreting the CSS of the page to ensure we're not picking up hidden elements, which is done by emulating a browser environment.

Ensuring Consistent Entity Extraction: While GPT-4 is highly capable, it’s not always guaranteed to follow the specific rules outlined in the prompt. This required careful prompt engineering and fine-tuning the model for better reliability.

Handling Wikipedia's Massive Database: Given that storing vectors for every Wikipedia page would demand an enormous amount of memory, we opted to use fewer dimensional vectors (e.g., 768 dimensions) and limit our scope to a subset of Wikipedia, such as Simple English Wikipedia. This helps streamline the process, although it comes with a slight trade-off in accuracy.

Why This Tool Matters for SEO
Structured data is the backbone of modern SEO. By enabling search engines to better understand your content, you improve the chances of your website appearing in rich results like knowledge panels, featured snippets, and more. However, manually creating schema for each page can be resource-intensive.

Our tool not only automates this process but does so with high accuracy, ensuring that your entities are correctly identified and matched with relevant Wikipedia links. With this streamlined approach, SEO teams can save countless hours while boosting their site's search visibility.

Final Thoughts
In a world where digital presence matters more than ever, automating SEO processes like schema generation can provide a competitive edge. Our new tool takes the heavy lifting out of creating structured data, ensuring that your site remains optimized, visible, and ready for the future of search.

Stay tuned as we continue to refine this tool and expand its capabilities. The future of SEO automation is here, and it's smarter than ever!

This blog post outlines the innovation and functionality behind our automated schema generator tool, ensuring the information is digestible for SEO professionals. Let me know if you'd like to tweak any section!

Previous Blog

Macaroni impact

Blink SEO
We have 14 clients with at least 20 actions implemented as of 1st May 2024. 'martins_chocolatier': 456, 'moderntribe': 319, 'erogenos': 216, 'ukmedi': 118, 'wwmake': 75, 'kokoso': 69, 'bigjigs_toys': 43, 'nuts_pick':...

Macaroni impact

Blink SEO
We have 14 clients with at least 20 actions implemented as of 1st May 2024. 'martins_chocolatier': 456, 'moderntribe': 319, 'erogenos': 216, 'ukmedi': 118, 'wwmake': 75, 'kokoso': 69, 'bigjigs_toys': 43, 'nuts_pick':...

3 min read

Back to blog

Blog / Automating entity schema generation

Macaroni impact

Macaroni impact