========== Quickstart ========== Install ======= Clone this repository. Then assuming `conda`_ is available, run:: make create_environment to create a conda environment called ``fme`` with dependencies and source code installed. Alternatively, a Docker image can be built with ``make build_docker_image``. You may verify installation by running ``pytest fme/``. .. _conda: https://docs.conda.io/en/latest/ Wandb Integration ================= For the optional Weights and Biases (wandb) integration, you will need to set the API key:: export WANDB_API_KEY=wandb-api-key where `wandb-api-key` is created and retrieved from the "API Keys" section of the `Wandb`_ settings page. .. _Wandb: https://wandb.ai/settings Commands ======== The following commands are available, and can be run with ``--help`` for more information: - ``python3 -m fme.ace.validate_config`` - Validate a configuration file - ``python3 -m fme.ace.train`` - Train a model - ``python3 -m fme.ace.inference`` - Run a saved model checkpoint - ``python3 -m fme.ace.evaluator`` - Run a saved model checkpoint and compare to target data Running a Checkpoint ==================== To run a model checkpoint, you need an initial conditions file containing all model inputs, and a forcing dataset containing all input-only variables. The files may include more variables, as in the example datasets below, but only the required variables will be used. The code will run an ensemble of predictions starting from each time specified in the initial conditions file. The forcing dataset must contain data for the times specified in the initial conditions file, as well as all timesteps required for the prediction period. An initial condition file is available via a public `requester pays`_ Google Cloud Storage bucket. .. code-block:: bash gsutil -u YOUR_GCP_PROJECT cp gs://ai2cm-public-requester-pays/2023-11-29-ai2-climate-emulator-v1/data/repeating-climSST-1deg-netCDFs/initial_condition/ic_0011_2021010100.nc initial_condition.nc The checkpoint and a 1-year subsample of the validation data are available at this `Zenodo repository`_. This validation data can be used as forcing data for the checkpoint. Alternatively, if interested in the complete dataset, this is available via a public `requester pays`_ Google Cloud Storage bucket. For example, the 10-year validation data (approx. 190GB) can be downloaded with: .. code-block:: bash gsutil -m -u YOUR_GCP_PROJECT cp -r gs://ai2cm-public-requester-pays/2023-11-29-ai2-climate-emulator-v1/data/repeating-climSST-1deg-netCDFs/validation . It is possible to download a portion of the dataset only, but it is necessary to have enough data to span the desired prediction period. The checkpoint is also available on GCS at `gs://ai2cm-public-requester-pays/2023-11-29-ai2-climate-emulator-v1/checkpoints/ace_ckpt.tar`. .. _Zenodo repository: https://zenodo.org/doi/10.5281/zenodo.10791086 .. _requester pays: https://cloud.google.com/storage/docs/requester-pays Save a ``inference-config.yaml`` file based on the :ref:`example config ` with updated paths for the downloaded data. Then in the ``fme`` conda environment, run inference with: .. code-block:: bash python -m fme.ace.inference config-inference.yaml See the :ref:`inference-config` section for more information on the configuration. If you run into configuration issues, you can validate your configuration with .. code-block:: bash python -m fme.ace.validate_config config-evaluator.yaml --config_type inference .. tip:: While inference can be performed without a GPU, it may be very slow. If running on a Mac, set the environmental variable ``export FME_USE_MPS=1`` to enable using the `Metal Performance Shaders`_ framework for GPU acceleration. Note this backend is not fully featured and it may not work with all inference features or for training. .. _Metal Performance Shaders: https://developer.apple.com/metal/pytorch/ Evaluating a Checkpoint ======================= When target data is available, it is possible to evaluate the model using the ``fme.ace.evaluator`` module. This requires a dataset including all input and output variables for the prediction period. The checkpoint and a 1-year subsample of the validation data are available at this `Zenodo repository`_. Download these to your local filesystem. Alternatively, if interested in the complete dataset, this is available via a public `requester pays`_ Google Cloud Storage bucket. For example, the 10-year validation data (approx. 190GB) can be downloaded with: .. code-block:: bash gsutil -m -u YOUR_GCP_PROJECT cp -r gs://ai2cm-public-requester-pays/2023-11-29-ai2-climate-emulator-v1/data/repeating-climSST-1deg-netCDFs/validation . Save a ``config-evaluator.yaml`` file based on the :ref:`example config ` with updated paths for the downloaded data. Then in the ``fme`` conda environment, run evaluation with: .. code-block:: bash python -m fme.ace.evaluator config-evaluator.yaml If you run into configuration issues, you can validate your configuration with .. code-block:: bash python -m fme.ace.validate_config config-evaluator.yaml --config_type evaluator Training a Model ================ Like inference, training a model requires datasets with all input and output variables. The complete training dataset is available via a public `requester pays`_ Google Cloud Storage bucket. Note the dataset is large, meaning it may take a long time to download and may result in significant transfer costs. The 100-year training data (approx. 1.9 TB) can be downloaded with: .. code-block:: bash gsutil -m -u YOUR_GCP_PROJECT cp -r gs://ai2cm-public-requester-pays/2023-11-29-ai2-climate-emulator-v1/data/repeating-climSST-1deg-netCDFs/train . It is advisable to use a separate datset for validation. The 10-year validation data (approx. 190GB) can be downloaded with: .. code-block:: bash gsutil -m -u YOUR_GCP_PROJECT cp -r gs://ai2cm-public-requester-pays/2023-11-29-ai2-climate-emulator-v1/data/repeating-climSST-1deg-netCDFs/validation . You will also require scaling files (``centering.nc`` and ``scaling.nc`` in the example training config) containing scalar values for the mean and standard deviation of each input and output variable. These are generated using the script located at ``scripts/data_process/get_stats.py``. .. testcode:: :hide: from fme.ace import TrainConfig import yaml import dacite with open('train-config.yaml', 'r') as f: config_dict = yaml.safe_load(f) config = dacite.from_dict( TrainConfig, data=config_dict, config=dacite.Config(strict=True) ) # These are referenced in the paragraph just above, if they change then # update both the docs and this test! print(config.stepper.normalization.global_means_path) print(config.stepper.normalization.global_stds_path) .. testoutput:: :hide: centering.nc scaling.nc Save a ``config-train.yaml`` file based on the :ref:`example config ` with updated paths for the downloaded data. Then in the ``fme`` conda environment, run evaluation with: .. code-block:: bash torchrun --nproc_per_node RANK_COUNT -m fme.ace.train config-train.yaml where RANK_COUNT is how many processors you want to run on. This will typically be the number of GPUs you have available. If running on a single GPU, you can omit the `torchrun` command and use ``python -m`` instead. If you run into configuration issues, you can validate your configuration with .. code-block:: bash python -m fme.ace.validate_config config-train.yaml --config_type train