hidekazu-konishi.com

Using Amazon Bedrock for titling, commenting, and OCR (Optical Character Recognition) with Amazon Nova Pro

First Published: 2024-12-24
Last Updated: 2024-12-24

Previously, I introduced examples of using Amazon Bedrock for image titling, commentary, and OCR (Optical Character Recognition) with Anthropic Claude 3.5 Sonnet(v1).

Using Amazon Bedrock for titling, commenting, and OCR (Optical Character Recognition) with Claude 3.5 Sonnet

This time, I will introduce examples of using Amazon Bedrock for image titling, commentary, and OCR (Optical Character Recognition) with Amazon Nova Pro.

* The source code published in this article and other articles by this author was developed as part of independent research and is provided 'as is' without any warranty of operability or fitness for a particular purpose. Please use it at your own risk. The code may be modified without prior notice.
* This article was written using AWS services on a personally registered AWS account.
* The Amazon Bedrock models used in the writing of this article were executed on 2024-12-24 (JST) and are based on the following End user license agreement (EULA) at that time.
Amazon Nova Pro (amazon.nova-pro-v1:0): End user license agreement (EULA) (AWS Customer Agreement and Service Terms)

Overview of Amazon Nova

Overview of Parameters Specified for Amazon Nova Pro

Using an example of executing invoke_model of bedrock-runtime with AWS SDK for Python (Boto3), the following outlines the parameters specified for Amazon Nova Pro.

import boto3
import json
import os
import sys
import re
import base64
import datetime

region = os.environ.get('AWS_REGION')
bedrock_runtime_client = boto3.client('bedrock-runtime', region_name=region)

def nova_pro_invoke_model(input_prompt, image_media_format=None, image_data_base64=None, model_params={}):
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "text": input_prompt
                }
            ]
        }
    ]

    if image_media_format and image_data_base64:
        messages[0]["content"].insert(0, {
            "image": {
                "format": image_media_format, 
                "source": { 
                    "bytes": image_data_base64 
                }
            }
        })

    body = {
        "messages": messages,
        "inferenceConfig": { # all Optional
            "max_new_tokens": model_params.get('max_tokens', 5120), # greater than 0, equal or less than 5k (default: dynamic*)
            "temperature": model_params.get('temperature', 0.7), # greater then 0 and less than 1.0 (default: 0.7)
            "top_p": model_params.get('top_p', 0.9), # greater than 0, equal or less than 1.0 (default: 0.9)
            "top_k": model_params.get('top_k', 50) # 0 or greater (default: 50)
        }
    }

    response = bedrock_runtime_client.invoke_model(
        modelId='amazon.nova-pro-v1:0',
        contentType='application/json',
        accept='application/json',
        body=json.dumps(body)
    )

    response_body = json.loads(response.get('body').read())
    response_text = response_body["output"]["message"]["content"][0]['text']
    return response_text

For general inference parameters for models dealing with text, such as temperature, top_p(topP), top_k(topK), and max_tokens_to_sample(maxTokens), please see the following article for explanations.

Basic Information about Amazon Bedrock with API Examples - Model Features, Pricing, How to Use, Explanation of Tokens and Inference Parameters

For more details on the parameters used with the Amazon Nova Pro model, please refer to the following AWS Document.
Reference: Complete request schema - Amazon Nova

Ownership and Copyright of Content Generated by the Amazon Nova Pro Model

Based on the license of Amazon Nova Pro, this document clarifies the ownership and copyright of content output by the model.

According to the "Intellectual Property" Section found on the Amazon Nova Micro, Lite, and Pro - AWS AI Service Cards, the provider, AWS offers uncapped intellectual property (IP) indemnity coverage for outputs of generally available Amazon Nova models (Users should verify the exact terms of the license).

AWS offers uncapped intellectual property (IP) indemnity coverage for outputs of generally available Amazon Nova models (see Section 50.10 of the AWS Service Terms). This means that customers are protected from third-party claims alleging IP infringement or misappropriation (including copyright claims) by the outputs generated by these Amazon Nova models. In addition, our standard IP indemnity for use of the Services protects customers from third-party claims alleging IP infringement (including copyright claims) by the Services (including Amazon Nova models) and the data used to train them.

Thus, as long as the content of the license is adhered to, the outputs generated by the model can be freely used by the user.

Architecture Diagram

For this trial, due to the aim of observing the output variations through input/output adjustments and parameter tuning when invoking the Amazon Nova Pro model from an AWS Lambda function, the setup was kept simple.
The AWS services that could be used to input events into the AWS Lambda function include Amazon API Gateway, Amazon EventBridge, among others, with event parameters being adapted according to the AWS resources used through mapping or transformers, or modifying the format on the AWS Lambda side.

Using Amazon Bedrock for titling, commenting, and OCR (Optical Character Recognition) with Amazon Nova Pro

In this configuration, titling, commentary, and OCR are requested separately to the Amazon Nova Pro model.
This is because attempting to direct all three tasks in a single request was observed to diminish the output accuracy for each task.

Implementation Example

This time, an AWS Lambda function was implemented to execute invoke_model of bedrock-runtime using AWS SDK for Python (Boto3).
Additionally, to observe the variations in outputs for each model parameter, the main parameters for each model were made adjustable via events.

import boto3
import json
import os
import sys
import re
import base64
import datetime

region = os.environ.get('AWS_REGION')
bedrock_runtime_client = boto3.client('bedrock-runtime', region_name=region)
s3_client = boto3.client('s3', region_name=region)

def get_format_from_media_type(media_type: str) -> str:
    if not media_type:
        return ''
    
    parts = media_type.split('/')
    if len(parts) < 2:
        return ''
    
    format_str = parts[-1].lower()
    
    return format_str

def nova_pro_invoke_model(input_prompt, image_media_format=None, image_data_base64=None, model_params={}):
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "text": input_prompt
                }
            ]
        }
    ]

    if image_media_format and image_data_base64:
        messages[0]["content"].insert(0, {
            "image": {
                "format": image_media_format, 
                "source": { 
                    "bytes": image_data_base64 
                }
            }
        })

    body = {
        "messages": messages,
        "inferenceConfig": { # all Optional
            "max_new_tokens": model_params.get('max_tokens', 5120), # greater than 0, equal or less than 5k (default: dynamic*)
            "temperature": model_params.get('temperature', 0.7), # greater then 0 and less than 1.0 (default: 0.7)
            "top_p": model_params.get('top_p', 0.9), # greater than 0, equal or less than 1.0 (default: 0.9)
            "top_k": model_params.get('top_k', 50) # 0 or greater (default: 50)
        }
    }

    response = bedrock_runtime_client.invoke_model(
        modelId='amazon.nova-pro-v1:0',
        contentType='application/json',
        accept='application/json',
        body=json.dumps(body)
    )

    response_body = json.loads(response.get('body').read())
    response_text = response_body["output"]["message"]["content"][0]['text']
    return response_text

def lambda_handler(event, context):
    # Format of the input event
    #{
    #    "input_s3_bucket_name": "[Target Amazon S3 bucket to retrieve the image]",
    #    "input_s3_object_key": "[Target Amazon S3 object key to retrieve the image]",
    #    "output_s3_bucket_name": "[Amazon S3 bucket to output the result JSON]",
    #    "output_s3_object_key_prefix": "[Amazon S3 object key to output the result JSON]",
    #    "nova_pro_model_id": "[Model ID for Amazon Nova Pro]",
    #    "nova_pro_temperature": 0.7,
    #    "nova_pro_top_p": 0.9,
    #    "nova_pro_top_k": 50,
    #    "nova_pro_max_tokens": 5120,
    #    "image_title_prompt": "[Custom prompt for image_title]",
    #    "image_description_prompt": "[Custom prompt for image_description]",
    #    "image_ocr_prompt": "[Custom prompt for image_ocr]"
    #}

    result = {}
    try: 
        input_s3_bucket_name = event['input_s3_bucket_name']
        input_s3_object_key = event['input_s3_object_key']
        output_s3_bucket_name = event['output_s3_bucket_name']
        output_s3_object_key_prefix = event.get('output_s3_object_key_prefix', input_s3_object_key)
        
        model_params = {
            'model_id': event.get('nova_pro_model_id', 'amazon.nova-pro-v1:0'),
            'temperature': event.get('nova_pro_temperature', 0.7),
            'top_p': event.get('nova_pro_top_p', 0.9),
            'top_k': event.get('nova_pro_top_k', 50),
            'max_tokens': event.get('nova_pro_max_tokens', 5120)
        }

        image_title_prompt = event.get('image_title_prompt', 'Please provide a title for this image.')
        image_description_prompt = event.get('image_description_prompt', 'Please provide a brief description of the title for this image.')
        image_ocr_prompt = event.get('image_ocr_prompt', 'Please extract all text contained in this image.')

        s3_object = s3_client.get_object(Bucket=input_s3_bucket_name, Key=input_s3_object_key)
        image_media_type = s3_object['ContentType']
        image_media_format = get_format_from_media_type(image_media_type)
        image_data = s3_object['Body'].read()
        image_data_base64 = base64.b64encode(image_data).decode('utf-8')

        # Invoke Model for image_title
        image_title_prompt = event.get('image_title_prompt', 'Please provide a title for this image.')
        input_prompt = f'{image_title_prompt} However, do not include your own commentary in the output; present the results in the following format:\nimage_title: <result>'
        image_title = nova_pro_invoke_model(input_prompt, image_media_format, image_data_base64, model_params).removeprefix('image_title:').removeprefix(' ')

        # Invoke Model for image_description
        image_description_prompt = event.get('image_description_prompt', 'Please provide a brief description of the title for this image.')
        input_prompt = f'{image_description_prompt} However, do not include your own commentary in the output; present the results in the following format:\nimage_description: <result>'
        image_description = nova_pro_invoke_model(input_prompt, image_media_format, image_data_base64, model_params).removeprefix('image_description:').removeprefix(' ')

        # Invoke Model for image_ocr
        image_ocr_prompt = event.get('image_ocr_prompt', 'Please extract all text contained in this image.')
        input_prompt = f'{image_ocr_prompt} However, do not include your own commentary in the output; present the results in the following format:\nimage_ocr: <result>'
        image_ocr = nova_pro_invoke_model(input_prompt, image_media_format, image_data_base64, model_params).removeprefix('image_ocr:').removeprefix(' ')

        response_json = {
            "image_title": image_title,
            "image_description": image_description,
            "image_ocr": image_ocr
        }
        output_json = json.dumps(response_json).encode('utf-8')
        
        output_s3_object_key = f'{output_s3_object_key_prefix.replace(".", "_")}_{datetime.datetime.now().strftime("%y%m%d_%H%M%S")}.json'
        
        s3_client.put_object(Bucket=output_s3_bucket_name, Key=output_s3_object_key, Body=output_json)
        
        result = {
            "status": "SUCCESS",
            "output_s3_bucket_url": f'https://s3.console.aws.amazon.com/s3/buckets/{output_s3_bucket_name}', 
            "output_s3_object_url": f'https://s3.console.aws.amazon.com/s3/object/{output_s3_bucket_name}?region={region}&bucketType=general&prefix={output_s3_object_key}'
        }
        
    except Exception as ex:
        print(f'Exception: {ex}')
        tb = sys.exc_info()[2]
        err_message = f'Exception: {str(ex.with_traceback(tb))}'
        print(err_message)
        result = {
            "status": "FAIL",
            "error": err_message
        }
        
    return result

Execution Details

Parameter Settings

I observed the changes in output by altering the following format of Event parameters passed to the implemented AWS Lambda function to various values.

{
  "input_s3_bucket_name": "[Target Amazon S3 bucket to retrieve the image]",
  "input_s3_object_key": "[Target Amazon S3 object key to retrieve the image]",
  "output_s3_bucket_name": "[Amazon S3 bucket to output the result JSON]",
  "output_s3_object_key_prefix": "[Amazon S3 object key prefix to output the result JSON]",
  "nova_pro_model_id": "amazon.nova-pro-v1:0",
  "nova_pro_temperature": 0.7,
  "nova_pro_top_p": 0.9,
  "nova_pro_top_k": 50,
  "nova_pro_max_tokens": 5120,
  "image_title_prompt": "Please provide a title for this image.",
  "image_description_prompt": "Please provide a brief description of the title for this image.",
  "image_ocr_prompt": "Please extract all text contained in this image."
}

Below, I introduce the titling, commentary, and OCR output for images, specifically using the Amazon Nova Pro with max_tokens set to the maximum of 5120, while other parameters were set to default settings.

Input Data

As an example of input data, I used a screenshot image of the top page of my Personal Tech Blog, including the blog description and article titles.
This example aims to assess the ability to accurately recognize standard document images with clear visibility.

Execution Results

Example: Titling, Commentary, and OCR for an English-Language Blog Article Image

* Input Data: Image

Personal Tech Blog | hidekazu-konishi.com

* Output Data: Image Title, Commentary, OCR

{
    "image_title": "Personal Tech Blog | hidekazu-konishi.com",
    "image_description": "Personal Tech Blog | hidekazu-konishi.com - Personal Tech Blog I hidekazu-konishi.com",
    "image_ocr": "hidekazu-konishi.com HOME > Personal Tech Blog Personal Tech Blog | hidekazu-konishi.com Here I plan to share my technical knowledge and experience, as well as my interests in the subject. Please note that this tech blog is a space for sharing my personal views and ideas, and it does not represent the opinions of any company or organization I am affiliated with. The main purpose of this blog is to deepen my own technical skills and knowledge, to create an archive where I can record and reflect on what I have learned and experienced, and to share information. My interests are primarily in Amazon Web Services (AWS), but I may occasionally cover other technical topics as well. The articles are based on my personal learning and practical experience. Of course, I am not perfect, so there may be errors or inadequacies in the articles. I hope you will enjoy this technical blog with that in mind. Thank you in advance. Privacy Policy Personal Tech Blog Entries First Published: 2022-04-30 Last Updated: 2024-03-14 Setting up DKIM, SPF, DMARC with Amazon SES and Amazon Route 53 - An Overview of DMARC Parameters and Configuration Examples Summary of AWS Application Migration Service (AWS MGN) Architecture and Lifecycle Relationships, Usage Notes - Including Differences from AWS Server Migration Service (AWS SMS) Basic Information about Amazon Bedrock with API Examples - Model Features, Pricing, How to Use, Explanation of Tokens and Inference Parameters Summary of Differences and Commonalities in AWS Database Services using the Quorum Model - Comparison Charts of Amazon Aurora, Amazon DocumentDB, and Amazon Neptune AWS Amplify Features Focusing on Static Website Hosting - Relationship and Differences between AWS Amplify Hosting and AWS Amplify CLI Host a Static Website configured with Amazon S3 and Amazon CloudFront using AWS Amplify CLI Host a Static Website using AWS Amplify Hosting in the AWS Amplify Console Reasons for Continually Obtaining All AWS Certifications, Study Methods, and Levels of Difficulty Summary of AWS CloudFormation StackSets Focusing on the Relationship between the Management Console and API, Account Filter, and the Role of Parameters AWS History and Timeline regarding AWS Key Management Service - Overview, Functions, Features, Summary of Updates, and Introduction to KMS AWS History and Timeline regarding Amazon EventBridge - Overview, Functions, Features, Summary of Updates, and Introduction AWS History and Timeline regarding Amazon Route 53 - Overview, Functions, Features, Summary of Updates, and Introduction AWS History and Timeline regarding AWS Systems Manager - Overview, Functions, Features, Summary of Updates, and Introduction to SSM AWS History and Timeline regarding Amazon S3 - Focusing on the evolution of features, roles, and prices beyond mere storage How to create a PWA(Progressive Web Apps) compatible website on AWS and use Lighthouse Report Viewer AWS History and Timeline - Almost All AWS Services List, Announcements, General Availability(GA) Written by Hidekazu Konishi HOME > Personal Tech Blog Copyright \u00a9 Hidekazu Konishi ( hidekazu-konishi.com ) All Rights Reserved."
}

Verification of Execution Results

For the OCR process in the above execution example, I verified how accurately the text contained in the image was extracted by comparing the actual blog article's text with the text outputted to image_ocr.
The upper half of the following image shows the actual text of the blog article, while the lower half displays the text outputted to image_ocr.

Verification of OCR Processing Results by Amazon Nova Pro

In text extraction from images using Amazon Nova Pro, while it output non-body text such as "hidekazu-konishi.com" at the beginning, it was able to read all the characters used in the main text.
However, in previous trials with Anthropic Claude 3.5 Sonnet, it would output Unicode characters written in HTML format as "\u2022" more frequently, and would output line breaks in the text within images as line break characters "\n" without specific instructions.
In this trial with Amazon Nova Pro, such outputs were not present.

References:
Tech Blog with curated related content

Summary

In this session, I introduced how to use Amazon Bedrock's Amazon Nova Pro for image titling, commentary, and OCR.

Through this trial, it was demonstrated that the image recognition capabilities of Amazon Nova Pro can recognize text in a standard blog article with high accuracy, presenting a viable use case for practical application. I plan to continue monitoring Amazon Bedrock for updates, implementation methods, and potential combinations with other services, as demonstrated in this article.

Written by Hidekazu Konishi