From Content Bottleneck to Productivity Powerhouse: How atronous Supercharged a Global Distributor

In today’s fast-paced global marketplace, high-quality product content is the lifeblood of online shopping and effective sales. It informs customers, builds trust, and ultimately drives conversions. But what happens when the very source of that vital content becomes a major roadblock?

That’s precisely the challenge a leading global distributor recently faced. Their product catalog boasted thousands of SKUs, each requiring detailed descriptions, compelling visuals, and essential documentation. The problem was, a lot of this information was hidden away in complex data sheets and PDF manuals on the manufacturer’s website.

Imagine the sheer manual effort involved. The distributor’s initial plan was daunting: hire dedicated resources to painstakingly browse the manufacturer’s site, copy-paste product details, download images and PDFs, and manually input everything into an Excel file. For over 2,000 products! This approach was not only time-consuming and prone to errors, but also a significant drain on resources and a major bottleneck in their go-to-market strategy.

Enter atronous: Turning Obstacles into Opportunities

Recognizing the inefficiency and scalability issues inherent in this manual process, the distributor turned to atronous for a smarter solution. Our team specializes in streamlining product content management, and we were eager to tackle this challenge head-on.

Here’s how atronous helped this global distributor break free from their content bottleneck and unlock significant productivity gains:

1. Intelligent Web Data Extraction and Parsing

Instead of a generic website crawl, we employed targeted web ingestion techniques using libraries and frameworks optimized for navigating and extracting data from complex website structures. This involved:

  • Comprehending the semantic content of the page: We employed computer vision models capable of interpreting the page content to extract pertinent details like UPC codes, manufacturer part numbers, and other attributes 
  • Using Content Selectors as fallback: We utilized browser developer tools and techniques like XPath and CSS selectors to pinpoint specific HTML elements containing product titles, descriptions, image URLs, PDF links, and attribute tables.
  • Robust Error Handling: We built mechanisms to gracefully handle website changes, broken links, and inconsistent data structures, minimizing disruptions and ensuring data integrity.
  • Data Parsing and Structuring: Once extracted, the raw HTML data was parsed and structured into a usable format. This involved:
    • Text Extraction and Cleaning: Eliminating unnecessary HTML tags, scripts, and styles to focus on the essential product details. Utilizing text cleansing methods to address formatting inconsistencies, spacing issues, and special characters.
    • Attribute Identification and Extraction: Developing logic to identify and extract key product attributes often presented in tables or structured lists. This involved pattern recognition and data normalization to ensure consistency across different product pages.

2. Parsing Data and Images from PDF Files

When the manufacturer’s website hosted product manuals or specifications as PDF files, our ingestion process was extended to incorporate specialized PDF parsing techniques. Here’s a breakdown of the technical approaches involved:

  • Link Identification and Downloading: Our web ingestion engine would identify links to PDF files within the manufacturer’s website, typically within product pages or dedicated resources sections. These links would be extracted, and the corresponding PDF files would be downloaded and stored securely, within their own cloud locations.
  • PDF Content Extraction – Choosing the Right Libraries and Tools: Depending on the complexity and structure of the PDFs, atronous utilized various open-source and commercial libraries and tools specifically designed for PDF processing in languages like Python . For more details on this step, check out this technical blog 
  • Table Data Extraction:
    • Heuristic-Based Table Detection: Algorithms analyze the layout of the PDF to identify potential table structures based on the presence of lines, spacing, and repeated patterns.
    • Structured Data Extraction: Once a table is identified, more sophisticated parsing techniques are used to extract the data within the rows and columns, often requiring handling of merged cells and complex table layouts.
    • Conversion to Structured Formats: The extracted table data is then typically converted into structured formats like lists of dictionaries or CSV-like structures for easier mapping and integration.

3. Automated Data Transformation and Template Mapping:

Data Normalization and Cleansing

  • Before mapping to the distributor’s templates, the extracted data underwent a normalization and cleansing process. This included:
    • Data Type Conversion: Ensuring data fields matched the expected data types in the target templates (e.g., converting text-based numbers to numerical formats).
    • Units of Measure Standardization: If necessary, standardizing units of measure (e.g., converting inches to centimeters).
    • Data Validation: Implementing rules to identify and flag potentially incorrect or missing data based on predefined criteria.

Template Mapping Engine

  • We developed a flexible mapping engine that allowed us to connect the extracted and normalized data fields to the specific columns and structures of the distributor’s upload templates (likely in formats such as CSV or Excel). This involved:
    • Configurable Mapping Rules: Providing an interface or configuration files to define the relationships between source data fields and target template fields.
    • Data Transformation Logic: Implementing custom logic within the mapping engine to perform more complex data transformations if required (e.g., combining multiple source fields into a single target field).
    • Batch Processing and Automation: Automating the process of applying the mapping rules to the extracted data in batches, generating the correctly formatted output files ready for upload into the distributor’s systems (e.g., PIM, ERP, e-commerce platform)

4. Multilingual Translation Integration:

  • API-Driven Translation Services: To handle the translation of product titles and descriptions into Spanish and French, we integrated with reputable cloud-based translation APIs (e.g., Google Translate API, Amazon Translate).
  • Automated Translation Workflow: The extracted titles and descriptions were automatically sent to the chosen translation API. The translated text was then received and incorporated back into the processed data
  • Quality Considerations: While fully automated, we often recommended and implemented options for human review of translations, especially for critical marketing copy, to ensure accuracy and cultural relevance.

Technical Benefits and Impact

By leveraging these technical capabilities, atronous provided a solution that was:

  • Scalable: Capable of handling large volumes of product data and adapting to future growth in the product catalog.
  • Efficient: Automating the data acquisition and preparation processes, significantly reducing manual effort and time.  
  • Accurate: Minimizing the risk of human error associated with manual data entry.
  • Integrated: Seamlessly fitting into the distributor’s existing systems and workflows through template-based output.
  • Global-Ready: Our automated translation capabilities setup the stage for expansion into new markets.

Conclusion

The 80% productivity boost wasn’t just a headline: it was the tangible outcome of a powerful, automated technical infrastructure. By replacing a manual, resource-heavy process with an efficient, scalable catalog creation system, the distributor’s teams were freed from the grunt work and empowered to focus on high-value tasks like content approval.

How atronous.ai is revolutionizing Product Taxonomy with ML and Embeddings

In the ever-evolving world of e-commerce, organizing and classifying products correctly is essential—not only for seamless navigation and discovery but also for maximizing conversion rates. Recently, atronous.ai deployed a cutting-edge machine learning pipeline that brings automation, intelligence, and precision to product taxonomy assignments, reducing manual effort and inconsistencies.

đź§  The Challenge: From Chaos to Classification

Retailers often receive product data in various formats—images, URLs, spreadsheets, or PDFs—from a wide range of suppliers. These items lack standardized taxonomies or come with misaligned categories that don’t fit the retailer’s existing classification system. This leads to a fragmented shopping experience and lost revenue opportunities.

🔍 Step 1: Identifying the Object from an Image URL

Using advanced computer vision models, the platform ingests product image URLs and identifies the object type in the photo. For example, in a recent engagement with a leading retailer, the platform processed entries such as:

https://cdn.shopify.com/s/files/…/Armchairs-Living-Room.jpg

Using a combination of convolutional neural networks (CNNs) and pretrained models (like CLIP or BLIP), the system outputs a probable object class like Armchair, Sofa, or Barstool, without any manual labeling.

🧬 Step 2: Training on the Retailer’s Current Taxonomy

What makes Atronous truly adaptive is its ability to learn from a retailer’s own classification structure. By training on the existing taxonomy tree—whether pulled from a Shopify store, ERP system, or internal database—the ML engine understands how categories are organized and how products typically flow through them.

For this retailer’s  use case, Atronous automatically mapped these detected objects to specific categories such as:
– Furniture → Seating → Armchairs
– Furniture → Living Room → Sofas
– Furniture → Dining → Barstools

đź§© Step 3: Assigning the Best Taxonomy Match

Once trained, the system uses similarity scoring between image embeddings and taxonomy embeddings to assign each product to the most appropriate category.

For instance:
– An image classified as Barstool with mid-century aesthetics was assigned to Furniture > Dining > Barstools with over 95% confidence.
– A decorative stool misclassified in the source data was automatically reclassified as an Accent Piece based on visual and semantic similarity.

This not only improves classification accuracy but allows the taxonomy to adapt dynamically to the retailer’s style and structure.

đź§ľ Bonus Step: Taxonomy Auditing & Alternative Suggestions

In a powerful fourth step, atronous audited the current taxonomy assignments and flagged inconsistencies or misclassifications. It also suggested potential improvements.

For example:
– Products categorized under Chairs that visually match Recliners were highlighted.
– Items split across Living Room Decor and Accent Chairs were recommended for consolidation or reclassification.

These suggestions are provided in an easy-to-review spreadsheet or directly synced into PIM systems.

🚀 The Impact

This AI-first approach has transformed how retailers handle product onboarding:
– 90% reduction in manual taxonomy assignment time
– 30% increase in accuracy of product categorization
– Improved shopper UX via cleaner navigation and product grouping

As product catalogs grow and become more complex, automation like this isn’t just helpful, it becomes mission critical.

Ready to Streamline Your Taxonomy?

atronous.ai is bringing intelligence to the very core of digital merchandising. If your team is struggling with inconsistent categorization, long onboarding cycles, or messy product feeds, we’d love to help.

Let’s talk. 💬

Schematic Extraction v0.1.0: Architecture, Enhancements and Future work

Schematics are the backbone of the retail industry but extracting them from PDFs remains a persistent challenge. Marketers, merchandisers, and product developers often waste valuable time navigating through cluttered documents to find key diagrams such as shelf layouts, packaging designs, or product schematics. This inefficiency not only impacts project deadlines but also drives up costs.

At atronous.ai we’ve developed the PDF Schematic Extractor, as an automated solution that precisely extracts and labels schematics from PDFs, URLs, or HTML files. By streamlining this process, we help teams save time, improve accuracy, and accelerate product development, all while reducing the risk of costly delays.

On the left is a complex page from a product document, and on the right is the clean schematic we extracted

Overview of the PDF Schematic Extractor

The PDF Schematic Extractor is a Python package designed to convert files (URLs, HTML) into PDFs and extract schematics using PyMuPDF, OpenCV, Tesseract, and Gemini for labeling. It supports both command-line and Streamlit web-based interfaces, enabling users to upload files, extract schematics, and download results as ZIP files. Recent enhancements have focused on improving the schematic extraction process, addressing accuracy, efficiency, and usability challenges.

Workflow of the PDF Schematic Extractor

The project follows a structured pipeline to process inputs and extract schematics, as illustrated in the flowchart:

1. Input Handling: The tool accepts three types of inputs – URLs, HTML files/strings, or PDFs. URLs and HTML inputs are processed using the url_to_pdf and html_to_pdf functions in core.py, which leverage wkhtmltopdf to convert them into PDFs. If the input is already a PDF, it is directly used for further processing.

2. PDF to Image Conversion: The PDF (whether input directly or converted from URL/HTML) is converted into images using PyMuPDF in the “Convert PDF into Images” step. Each page of the PDF is rendered as an individual image to facilitate image-based processing.

3. Image Processing and Text Detection: Each image undergoes preprocessing in the “Image Processing” step to enhance quality (e.g., adjusting contrast). Tesseract OCR is then applied in the “OCR Text Extraction” step to detect text, and a “Filter Text-Heavy Areas” step identifies and excludes regions dominated by text, ensuring focus on schematic-like areas.

4. Schematic Detection and Extraction: The “Contour Detection” step uses OpenCV to identify potential schematic regions by detecting contours. These contours are grouped in the “Group Contours” step to form cohesive schematic entities, reducing fragmentation. The “Extract Schematics from Images” step isolates these regions as individual schematic images.

5. Labeling with Gemini: The “Schematic Validation” step uses Gemini to confirm that extracted regions are indeed schematics. Nearby text is extracted in the “Get Nearby Text” step, and Gemini generates specific labels in the “Generate Specific Labels with Gemini” step, providing context-aware labels (e.g., “Circuit Diagram: Power Supply Unit”).

6. Output Generation: The extracted schematics are saved as images in the “Schematic Images” step. A “Debugging with Bounding Boxes” option overlays bounding boxes and labels on the images for user review. The final output, including metadata (e.g., page number, label), is saved as a JSON file and can be downloaded as a ZIP file via the web interface.

Enhancements to Schematic Extraction

1. Improved Detection Accuracy with Contour Grouping and Filtering: The initial schematic detection often missed or fragmented schematics in complex PDFs. We introduced a “Group Contours” step (as shown in the flowchart) to cluster nearby contours into cohesive schematic entities based on proximity and size.Additionally, a “Filter Text-Heavy Areas” step using Tesseract OCR excludes text-dominated regions, ensuring focus on schematic-like areas (e.g., technical drawings). This has boosted detection accuracy , especially in

mixed-content PDFs.

2. Enhanced Labeling with Gemini Integration: Labeling previously suffered from inconsistent OCR results. By integrating Gemini for “Schematic Validation” and “Generate Specific Labels with Gemini,” we improved label precision. Gemini validates schematic regions and generates context-aware labels by analyzing nearby text and visual content. For example, a circuit diagram might now be labeled as “Circuit Diagram: Power Supply Unit” instead of “Diagram,” enhancing usability for downstream applications like cataloging.

3. Optimized Preprocessing for Faster Extraction: The original preprocessing pipeline was slow for large PDFs due to redundant conversions. We optimized the “Image Processing” and “Convert PDF into Images” stages by implementing parallel processing for multi-page PDFs and reducing intermediate image resolution without quality loss. The “Extract Schematics from Images” step now uses adaptive thresholding (via the –threshold parameter), cutting processing time while maintaining accuracy.

4. User Customization and Debugging Support: To address “No Schematics Extracted” issues, we enhanced configurability with tunable parameters (–min-area, –padding, –threshold) via the CLI. A “Debugging with Bounding Boxes” output option was added, generating schematic images with overlaid bounding boxes and labels. This transparency helps users debug and adjust extraction settings effectively.

Future Possible Developments

1. Real-Time Processing for Web Interface

The Streamlit web app processes PDFs in batch mode, which is slow for large files. Implementing real-time processing—extracting and displaying schematics as each page is processed would improve user experience. This could leverage asynchronous processing and streaming inputs for faster feedback.

2. Enhanced Labeling with Contextual Understanding and Local LLMs/SLMs

While Gemini has improved labeling, it can misinterpret context due to limited document-wide understanding and reliance on an external API. Future work could integrate a natural language processing (NLP) model to analyze the entire PDF’s text, providing contextual cues (e.g., identifying the document’s domain as electrical engineering) to improve labeling accuracy. Additionally, incorporating local large language models (LLMs) or small language models (SLMs) such as LLaMA, Phi-4-multimodal, DistilBERT would enable on-device processing, reducing dependency on external APIs like Gemini. This would enhance

privacy, lower latency, and allow offline functionality. For instance, a local SLM could be fine-tuned on technical terminology to label a schematic as “Transistor Layout” with higher precision, even in resource-constrained environments.

3. Support for 3D Schematics and Interactive Outputs

Modern PDFs often include 3D or interactive schematics. Extending the tool to detect and extract these, potentially rendering them as interactive 3D models in the web app using libraries like Three.js, would add value. This requires algorithms to parse embedded 3D data in PDFs and render it dynamically.

4. Cross-Platform Compatibility and Cloud Deployment

Manual dependency installation (e.g., wkhtmltopdf, Tesseract) can be challenging for users. Containerizing the application with Docker would simplify cross-platform deployment. Additionally, deploying the tool as a cloud service on AWS or Google Cloud would enable users to process PDFs without local setup, increasing accessibility.

Conclusion

The schematic extraction process in the PDF Schematic Extractor has been significantly enhanced through improved detection, labeling, preprocessing, and user support. These changes have made the tool more accurate, efficient, and user-friendly. Looking ahead, integrating machine learning, real-time processing, local LLMs/SLMs, and cloud deployment can further elevate the tool’s capabilities, making it a versatile solution for schematic

extraction across diverse use cases.

For questions/feedback feel free to reach out: yagnesh.mangali@atronous.ai