Skip to content

Craft TODO

TODO : list of open question + put somewhere the content

  • how shall we map bucket
    • 1 bucket for build deployed objects (js files)
    • 1 bucket for each client to store the artefacts content.
    • shall we provide access to each front deployement inside client bucket or in a dedicated one ?
    • which configuration of the bucket shall we propose (archivage, access right, ...)
  • clarify uri rules for the proxy to bucket
    • client_x.safenai.io/ : maps to client web site

    • client_x.safenai.io/artefacts/ : maps artefact tree associated to the client

    • common.safeanai.io/bundle/ : maps to the build version of web apps

    • Front shall request data on demande with a caching strategy

      • Not too big / Not too small bloc
      • Check how we manage cache strategy inside Navigator.
    • Packaging of data with request proximity to prevent segmentation

      • Each "version" shall have

        • a package of all essentials data in one file
        • a package of all traçability / decision associated with
          • how to identify data not to be store inside the version package
          • integrate small definition of artefact in this file
      • Packaging dataset with sample that are referenced inside version

        • TODO : How to reuse data plateforme => action urgente
      • All data are stored inside the S3, some of them loaded inside Database to fasten read access

        • With arberescence of versions inside the bucket

          • shall represent the following path
            • project_name
            • version_id
            • Artefact_type
            • Artefact_id
        • With access to data outside the project / shared by projects : reuse data plateforme ?

          • data samples and dataset
            • data samples :
              • raw data + acquisition meta_data ( immutable )
              • meta_data add to raw data (including anotations) (change version of the data sample)
              • metrics computed for a sample (associated with a model / tools / component version => To manage ? )
            • dataset
              • list of used data samples ( version of the dataset)
              • metrics computed on the dataset (version with of associted tools + metrics)
            • fondation models

        Questions :

        • How to manage access right access with https to the bucket
        • Shall we use S3 api instead of https from the front ?

TODO for blueprint

  • Model training

    • Training diffusion / GAN models ==> Learn how to generate sample in the distribution of each of the subsample ==> add context to generatation from au context computed before ==> Train the model to identify the context (if intrinsec / computed ) ==> Split the latent space between

      • Contextual information discovery
      • Robust to perturbation information
      • perturbation evaluation (attack intensity)
    • Training anomaly detection models ==> Add a dedicatec head to identify difficulty to reconstruct sample the sample (unkown unknown) ==> dedicate objectifs, probability of anomaly + type of anomaly ==> location / positions, feature that lead to an anomaly ==>

    • Training OOD model (detection out of distribution model) ==> Using specially OOD domain generated and/or identified sample ==> Classifier

  • How data are managed

    • Data shall be imported in KS
    • Data processing pipeline shall be use to transform / complete a Data Exemple ==> Configuration of the pipeline - possible trigger (on new data) - on datatransform change - cache storage / traçability policy / (keep reference of origin data or a copy) - can rely on a human (annotation) HOW - DAG creation (Airflow ? / MLFlow ? ) - Generate a graph of data dependency
    • Use generic dataset to build auxiliary models

How to get information to build the data process - ODD perturbation to acquisitions process - Do you already know factor that influance anomaly - Do you have a list of anomalies, - Is there a description of these types of anomalies - Can you provide a description of several anomalies - can you provide these data - As much as possible data without anomalies - at least several anomalies already detected

Project steps - gather preliminary information from craftmanship with several sample - create preliminary datapipeline - gather a first bunch of sample (valid / invalid sample already identified is optionnal) - gather several sample of anomalies, with description of the anomalies

- using klarity split to try to identify in the valid data (with or without anomalies) 
	| train several models on different subset and use other subset to indentify anomalies 
- generate a bunsh of primary metrics 
	- density in the distance evaluation space (t-snee or equivalent)  
	- several generated valid sample to check problem understanding / use them as anomalies / reject them ==> classifier of generated samples 
	- Using different classification during training and or between models, select sample of interest 
	- Anomalies metrics identification 
	==> Confusion matrics 
	==> Confirmal prediction probability
	
	
- proposal of multi dimensional data generator for anotation / comment
		- based on the split + t-snee density of auto encoder generator 
		- request user in order to try to name / describe these dimension 
		- identify zone of weak density in this distribution and ask user of the following 
			- Is it inside the ODD or OOD ? 
			- Do you have more shample of these type 
			- Are these genereted sample representatif / clause ? 

- Use a firt anotation job (or use de deployed shadow version to gather anotations)
- Using this bach generate the first realistic AI component associated metrics 
	- Metrics for expert evaluation of FP / TN 
	- Metrics for dataset quality evaluations 
	- Metrics for ODD coverage 
	- Projection of expected error rate if put in production
	- Comparison with operator error rate.
	- Generate a Benefice / Cost metric base on these parameter provided in scoping 
		- Ratio of quality check post production
		- Cost of a FP Recall
		- Cost of a FP undetected
		- Cost of a TN rework generated
		- Cost of operator validation
		- Cost of quality inspection of a sample 
	- Identify several possible action to improve the AI Component 

Value proposal : - We configure / deploy our solution in interaction with your production environment - You interact with our product providing sample, several anotation, expert knolodge
- Our product with our team support generate a AI component - We provide the component that shall be deploy in you environment or we provide a per sample inference cost. - We can put a key on the ROI generated (several cost / benefice model) exemple : - Inference cost is link to performance and ROI computed whith the client, shared interest to perform action that improve the performance.

TODO : ==> Add specific metrics that provide a template rational to gather user information, exemple

  • the metric is a list of parameters to validate in scoping
  • The metric is displayed as a form, and an action allow custom creation of a rational
  • The rational is reviewed and validated

This metric shall be used to gather - Cost operation metrics - Proces yes/no verifications (AI Act) - Risque evaluation (High / Low) ...

QUESTIONS :

  • shall we rename metric as artefact ?