Craft TODO
TODO : list of open question + put somewhere the content
- how shall we map bucket
- 1 bucket for build deployed objects (js files)
- 1 bucket for each client to store the artefacts content.
- shall we provide access to each front deployement inside client bucket or in a dedicated one ?
- which configuration of the bucket shall we propose (archivage, access right, ...)
- clarify uri rules for the proxy to bucket
client_x.safenai.io/ : maps to client web site
client_x.safenai.io/artefacts/ : maps artefact tree associated to the client
common.safeanai.io/bundle/ : maps to the build version of web apps
Front shall request data on demande with a caching strategy
- Not too big / Not too small bloc
- Check how we manage cache strategy inside Navigator.
Packaging of data with request proximity to prevent segmentation
Each "version" shall have
- a package of all essentials data in one file
- a package of all traçability / decision associated with
- how to identify data not to be store inside the version package
- integrate small definition of artefact in this file
Packaging dataset with sample that are referenced inside version
- TODO : How to reuse data plateforme => action urgente
All data are stored inside the S3, some of them loaded inside Database to fasten read access
With arberescence of versions inside the bucket
- shall represent the following path
- project_name
- version_id
- Artefact_type
- Artefact_id
- shall represent the following path
With access to data outside the project / shared by projects : reuse data plateforme ?
- data samples and dataset
- data samples :
- raw data + acquisition meta_data ( immutable )
- meta_data add to raw data (including anotations) (change version of the data sample)
- metrics computed for a sample (associated with a model / tools / component version => To manage ? )
- dataset
- list of used data samples ( version of the dataset)
- metrics computed on the dataset (version with of associted tools + metrics)
- fondation models
- data samples :
- data samples and dataset
Questions :
- How to manage access right access with https to the bucket
- Shall we use S3 api instead of https from the front ?
TODO for blueprint
Model training
Training diffusion / GAN models ==> Learn how to generate sample in the distribution of each of the subsample ==> add context to generatation from au context computed before ==> Train the model to identify the context (if intrinsec / computed ) ==> Split the latent space between
- Contextual information discovery
- Robust to perturbation information
- perturbation evaluation (attack intensity)
Training anomaly detection models ==> Add a dedicatec head to identify difficulty to reconstruct sample the sample (unkown unknown) ==> dedicate objectifs, probability of anomaly + type of anomaly ==> location / positions, feature that lead to an anomaly ==>
Training OOD model (detection out of distribution model) ==> Using specially OOD domain generated and/or identified sample ==> Classifier
How data are managed
- Data shall be imported in KS
- Data processing pipeline shall be use to transform / complete a Data Exemple ==> Configuration of the pipeline - possible trigger (on new data) - on datatransform change - cache storage / traçability policy / (keep reference of origin data or a copy) - can rely on a human (annotation) HOW - DAG creation (Airflow ? / MLFlow ? ) - Generate a graph of data dependency
- Use generic dataset to build auxiliary models
How to get information to build the data process - ODD perturbation to acquisitions process - Do you already know factor that influance anomaly - Do you have a list of anomalies, - Is there a description of these types of anomalies - Can you provide a description of several anomalies - can you provide these data - As much as possible data without anomalies - at least several anomalies already detected
Project steps - gather preliminary information from craftmanship with several sample - create preliminary datapipeline - gather a first bunch of sample (valid / invalid sample already identified is optionnal) - gather several sample of anomalies, with description of the anomalies
- using klarity split to try to identify in the valid data (with or without anomalies)
| train several models on different subset and use other subset to indentify anomalies
- generate a bunsh of primary metrics
- density in the distance evaluation space (t-snee or equivalent)
- several generated valid sample to check problem understanding / use them as anomalies / reject them ==> classifier of generated samples
- Using different classification during training and or between models, select sample of interest
- Anomalies metrics identification
==> Confusion matrics
==> Confirmal prediction probability
- proposal of multi dimensional data generator for anotation / comment
- based on the split + t-snee density of auto encoder generator
- request user in order to try to name / describe these dimension
- identify zone of weak density in this distribution and ask user of the following
- Is it inside the ODD or OOD ?
- Do you have more shample of these type
- Are these genereted sample representatif / clause ?
- Use a firt anotation job (or use de deployed shadow version to gather anotations)
- Using this bach generate the first realistic AI component associated metrics
- Metrics for expert evaluation of FP / TN
- Metrics for dataset quality evaluations
- Metrics for ODD coverage
- Projection of expected error rate if put in production
- Comparison with operator error rate.
- Generate a Benefice / Cost metric base on these parameter provided in scoping
- Ratio of quality check post production
- Cost of a FP Recall
- Cost of a FP undetected
- Cost of a TN rework generated
- Cost of operator validation
- Cost of quality inspection of a sample
- Identify several possible action to improve the AI Component
Value proposal : - We configure / deploy our solution in interaction with your production environment - You interact with our product providing sample, several anotation, expert knolodge
- Our product with our team support generate a AI component - We provide the component that shall be deploy in you environment or we provide a per sample inference cost. - We can put a key on the ROI generated (several cost / benefice model) exemple : - Inference cost is link to performance and ROI computed whith the client, shared interest to perform action that improve the performance.
TODO : ==> Add specific metrics that provide a template rational to gather user information, exemple
- the metric is a list of parameters to validate in scoping
- The metric is displayed as a form, and an action allow custom creation of a rational
- The rational is reviewed and validated
This metric shall be used to gather - Cost operation metrics - Proces yes/no verifications (AI Act) - Risque evaluation (High / Low) ...
QUESTIONS :
- shall we rename metric as artefact ?