In his recent study (Havers et al., 2024), our PhD Tim Havers investigated the potential and limitations of artificial intelligence (AI), in particular large language models (LLMs), for the creation of strength training plans for muscle hypertrophy. With the increasing popularity of AI applications such as ChatGPT, the question arises as to whether these models are practical and effective enough to be used in the fitness sector. Together with researchers from IST Hochschule, TU Braunschweig and the University of Würzburg, we investigated 1) whether the input of more detailed prompt information leads to a higher quality of training plans in Google Gemini and GPT-4 (via Microsoft Copilot), 2) whether there are differences in the quality of training plans between Google Gemini and GPT-4 and 3) how consistent the results of the same prompt are within a model.
To answer this question, two prompts were created that reflect two trainees with different levels of knowledge and progress. Prompt 1 reflects a general, simple prompt for a beginner trainee with the goal of muscle growth: ‘Please provide me with a resistance training plan to increase muscle hypertrophy’. The second prompt contains detailed information from an advanced user, taking into account specific information such as age, gender, height, weight, training experience and training preferences.
Each LLM was fed the prompts by two independent people. The generated plans were then evaluated by twelve strength & conditioning experts with academic backgrounds based on defined criteria (general aspects: e.g. health checks, diagnostics; training principles; load normative and advanced training aspects). The expert assessment was carried out on a Likert scale from 1 (poor) to 5 (very good).
The results show that 1) training plans based on detailed prompts were consistently rated better. This shows how important the quality and precision of the input is for the results. 2) The training plans generated by GPT-4 were of higher overall quality than those generated by Google Gemini. Nevertheless, the plans of both models were not optimal, as discrepancies were often found between goals, user wishes and the plans actually created. Individual evaluation criteria were only rarely rated with a 5, with scores below 3 being awarded frequently. 3) The quality of the training plans remained largely the same when the same prompts were repeatedly entered into a model, although the exact content of the plans varied.
To summarise, it can be said that AI can provide a valuable basis in certain cases, but should not be used without reflection. It cannot replace a coach, but serves as a supporting tool that should be supplemented by specialist expertise.
Reference:
Havers, T., Masur, L., Isenmann, E., Geisler, S., Zinner, C., & Sperlich, B. et al. (2025). Reproducibility and quality of hypertrophy-related training plans generated by GPT-4 and Google Gemini as evaluated by coaching experts. Biology of Sport, 42(2), 289-329. https://doi.org/10.5114/biolsport.2025.145911