Application Data Vectorization (ArkTS)

When to Use

In the pivotal shift from digital transformation to AI advancement, creating intelligent services is essential for boosting product competitiveness.

Currently, ArkData Intelligence Platform (AIP) provides application data vectorization, which leverages embedding models to convert multi-modal data such as unstructured text and images into semantic vectors.

Since API version 15, application data vectorization is supported.

Basic Concepts

To get started, it is helpful to understand the following concepts:

Vectorization

The process of vectorization uses embedding models to convert high-dimensional unstructured data (such as text and images) into low-dimensional continuous vector representations. This approach captures the semantic relationships with the data, translating abstract information into a format that can be analyzed and processed by computers. Embedding technology is widely used in fields such as natural language processing (semantic search), image recognition (feature extraction), and recommendation systems (user/item representation).

Multi-Modal Embedding Model

Embedding models are used to implement application data vectorization. The system supports multimodal embedding models, which can map different data modalities, such as text and images, into a unified vector space. These models support both single-modal semantic representation (text-to-text and image-to-image retrieval) and cross-modal capabilities (text-to-image and image-to-text retrieval).

Text Segmentation

To address length limitations when textual data is vectorized, you can use the APIs provided by the ArkData Intelligence Platform (AIP) to split the input text into smaller sections. This approach ensures efficient and effective data vectorization.

Implementation Mechanism

By leveraging the AIP, you can implement intelligent data construction. All these capabilities operate within the application processes, ensuring that data always remains in the application environment. This ensures data security and safeguards user privacy.

Working Principles

Application data vectorization involves converting raw application data into vector formats and storing them in a vector database (store).

Constraints

  • Considering the significant computing workload and resources of data vectorization processing, the APIs are only available to 2-in-1 device applications.
  • You can use NPUs to accelerate the inference process of embedding models. NPUs are recommended because pure CPU computation falls far behind in latency and energy efficiency.
  • The model can process up to 512 characters of text per inference, supporting both Chinese and English.
  • The model can handle images below 20 MB in size in a single inference.

Available APIs

The following table lists the APIs related to application data vectorization. For more APIs and their usage, see ArkData Intelligence Platform.

API Description
getTextEmbeddingModel(config: ModelConfig): Promise<TextEmbedding> Obtains a text embedding model.
loadModel(): Promise<void> Loads the text embedding model.
splitText(text: string, config: SplitConfig): Promise<Array<string>> Splits text.
getEmbedding(text: string): Promise<Array<number>> Obtains the embedding vector of the given text.
getEmbedding(batchTexts: Array<string>): Promise<Array<Array<number>>> Obtains the embedding vector of a given batch of text.
releaseModel(): Promise<void> Releases the text embedding model.
getImageEmbeddingModel(config: ModelConfig): Promise<ImageEmbedding> Obtains an image embedding model.
loadModel(): Promise<void> Loads the image embedding model.
getEmbedding(image: Image): Promise<Array<number>> Obtains the embedding vector of the given image.
releaseModel(): Promise<void> Releases the image embedding model.

How to Develop Text Vectorization

  1. Import the intelligence module.

    import { intelligence } from '@kit.ArkData';
    import { BusinessError } from '@kit.BasicServicesKit';
    
  2. Obtain a text embedding model.

    Use the getTextEmbeddingModel method to obtain a text embedding model. The sample code is as follows:

    let textConfig:intelligence.ModelConfig = {
      version:intelligence.ModelVersion.BASIC_MODEL,
      isNpuAvailable:false,
      cachePath:"/data"
    }
    let textEmbedding:intelligence.TextEmbedding;
    
    intelligence.getTextEmbeddingModel(textConfig)
      .then((data:intelligence.TextEmbedding) => {
        console.info('Succeeded in getting TextModel');
        textEmbedding = data;
        // ...
      })
      .catch((err:BusinessError) => {
        console.error('Failed to get TextModel and code is ' + err.code);
        // ...
      })
    
  3. Load the text embedding model.

    Use the loadModel method to load the text embedding model. The sample code is as follows:

    textEmbedding.loadModel()
      .then(() => {
        console.info('Succeeded in loading Model');
        // ...
      })
      .catch((err:BusinessError) => {
        console.error('Failed to load Model and code is ' + err.code);
        // ...
      })
    
  4. Split text. If the data length exceeds the limit, call splitText() to split the data into smaller text blocks and then vectorize them.

    The sample code is as follows:

    let splitConfig:intelligence.SplitConfig = {
      size:10,
      overlapRatio:0.1
    }
    let splitText = 'text';
    
    intelligence.splitText(splitText, splitConfig)
      .then((data:Array<string>) => {
        console.info('Succeeded in splitting Text');
        // ...
      })
      .catch((err:BusinessError) => {
        console.error('Failed to split Text and code is ' + err.code);
        // ...
      })
    
  5. Obtain the embedding vector of the given text using the getEmbedding method. The given text can be a single piece of text or a collection of multiple text entries.

    The sample code is as follows:

    let text = 'text';
    textEmbedding.getEmbedding(text)
      .then((data:Array<number>) => {
        console.info('Succeeded in getting Embedding');
        // ...
      })
      .catch((err:BusinessError) => {
        console.error('Failed to get Embedding and code is ' + err.code);
        // ...
      })
    
    let batchTexts = ['text1','text2'];
    textEmbedding.getEmbedding(batchTexts)
      .then((data:Array<Array<number>>) => {
        console.info('Succeeded in getting Embedding');
        // ...
      })
      .catch((err:BusinessError) => {
        console.error('Failed to get Embedding and code is ' + err.code);
        // ...
      })
    
  6. Release the text embedding model.

    Use the releaseModel method to release the text embedding model. The sample code is as follows:

    textEmbedding.releaseModel()
      .then(() => {
        console.info('Succeeded in releasing Model');
        // ...
      })
      .catch((err:BusinessError) => {
        console.error('Failed to release Model and code is ' + err.code);
        // ...
      })
    

How to Develop Image Vectorization

  1. Import the intelligence module.

    import { intelligence } from '@kit.ArkData';
    import { BusinessError } from '@kit.BasicServicesKit';
    
  2. Obtain an image embedding model.

    Use the getImageEmbeddingModel method to obtain an image embedding model. The sample code is as follows:

    let imageConfig:intelligence.ModelConfig = {
      version:intelligence.ModelVersion.BASIC_MODEL,
      isNpuAvailable:false,
      cachePath:"/data"
    }
    let imageEmbedding:intelligence.ImageEmbedding;
    
    intelligence.getImageEmbeddingModel(imageConfig)
      .then((data:intelligence.ImageEmbedding) => {
        console.info('Succeeded in getting ImageModel');
        imageEmbedding = data;
        // ...
      })
      .catch((err:BusinessError) => {
        console.error('Failed to get ImageModel and code is ' + err.code);
        // ...
      })
    
  3. Load the image embedding model.

    Use the loadModel method to load the image embedding model. The sample code is as follows:

    imageEmbedding.loadModel()
      .then(() => {
        console.info('Succeeded in loading Model');
        // ...
      })
      .catch((err:BusinessError) => {
        console.error('Failed to load Model and code is ' + err.code);
        // ...
      })
    
  4. Obtain the embedding vector of the given image.

    Use the getEmbedding method to obtain the embedding vector of the given image. The sample code is as follows:

    let image = 'file://<packageName>/data/storage/el2/base/haps/entry/files/xxx.jpg';
    imageEmbedding.getEmbedding(image)
      .then((data:Array<number>) => {
        console.info('Succeeded in getting Embedding');
        // ...
      })
      .catch((err:BusinessError) => {
        console.error('Failed to get Embedding and code is ' + err.code);
        // ...
      })
    
  5. Release the image embedding model.

    Use the releaseModel method to release the image embedding model. The sample code is as follows:

    imageEmbedding.releaseModel()
      .then(() => {
        console.info('Succeeded in releasing Model');
        // ...
      })
      .catch((err:BusinessError) => {
        console.error('Failed to release Model and code is ' + err.code);
        // ...
      })