Interaction with Neural Network Model
Bold text is used for represent user prompt
Bold italic text is used for represent neural network response
As part of the experimental phase of the study, an extended dialogue was conducted with ChatGPT-4o to generate a series of introductory mathematical literacy tasks for 5th-grade students. The goal of this interaction was to test the effectiveness of the developed prompting strategy under conditions of contextual change, while maintaining the structural and cognitive complexity of the benchmark tasks.
The model was initially given a professional role using the instruction:
USER > You are a specialist in mathematics education developing tasks in accordance with the Federal State Educational Standards (FSES).
This technique, known as role prompting, activates generation patterns linked to expert-level content creation. Studies have shown that role framing improves both the internal consistency of responses (by 18–23%) and their alignment with expert reasoning in the field [23]. Aher et al. [24] further demonstrated that assigning roles such as “math teacher” improves a model’s ability to consider age-specific cognitive characteristics and avoid cognitive overload. For educational tasks, combining role prompting with Few-Shot learning is especially effective: the model not only simulates expert reasoning but also replicates the structure of exemplar tasks, as shown in experiments with GPT-4o [13], [25].
Following this, the model was provided with a text file containing a set of example tasks, along with a list of the core content elements relevant to the development and assessment of mathematical literacy for Grade 5 students [26].
The user then clarified:
USER > This is a complete task set. The tables describing “Task Characteristics” were based on the following parameters: [parameters described in Section 2.1]. Here's some additional information on what is meant by “assessment target” (subject-specific learning outcome): [...]
Your task is to generate structurally and cognitively similar tasks using a different context. Ready?
ChatGPT > Yes, ready! What context would you like to use? The examples were based on maritime transport, but we could use aviation, cars, space, sports, agriculture, or something else.
Also, please clarify the target age group so that I can maintain an appropriate difficulty level.
USER > Use the context of airports. Age group: 10–11 (5th grade, Russian school). Let’s start with the block of introductory tasks.
The model then generated three tasks, each with a description, reasoning, and a full characteristics table. Here's an example from the first task:
Task 1
What is the area of the Domodedovo airport terminal? Provide your answer in square meters.
Answer format: Answer: ___ m²
Solution (5th-grade level):
The student looks at Table 1, finds the “Terminal Area (thousand m²)” column, and locates the value for Domodedovo: 350 thousand m².
To convert to square meters: 350 × 1000 = 350,000 m²
Answer: 350,000 m²
Table: Task Metadata (Task 1)
Parameter Value
Content domain Uncertainty and data
Context Education / professional activity
Cognitive process Applying procedures
Assessment target Unit conversion (length and area)
Difficulty level 1
Response format Short answer
Scoring criteria 1 point – correct answer; 0 – incorrect or missing answer
After the completion of the introductory task block, the model was asked to move on:
USER > Now that the “introductory tasks” are complete and each task includes its metadata, let’s proceed to the “Did You Know?” instructional tasks.
The dialogue revealed that the model understood the need to generate tasks step by step, preserving the structure and logic of the original benchmark set. According to theoretical principles discussed earlier, tasks in later blocks should increase in complexity, contain more data for analysis, and include more cognitively demanding questions. To initiate this next step, the user issued the following prompt:
USER > Create a “Did You Know?” instructional block with 5 tasks. These tasks should help students identify potential errors they might have made during the introductory tasks and teach them to work with data more attentively. Provide full metadata for each task.
ChatGPT >
Great, let’s create a new “Did You Know?” block with 5 tasks, following the structure of the previous examples. We will use data on airports, and each question will focus on data analysis and error detection. Tasks will encourage critical thinking and attention to detail.
Context:
Petya is studying data on major Russian airports and analyzing their characteristics. He encounters a list of questions and must determine which ones can be answered using only the data in the table.
Table 2. Major Russian Airports
Airport City Passenger Traffic (million/year) Terminals Avg. Delay (min) Daily Flights
Sheremetyevo Moscow 45.8 5 18 1100
Domodedovo Moscow 30.7 3 25 950
Vnukovo Moscow 18.1 2 22 650
Pulkovo St. Petersburg 19.6 1 30 480
Koltsovo Yekaterinburg 7.6 2 35 320
Tolmachevo Novosibirsk 8.1 2 40 400
Task 1
1. Which of the following questions can be answered using only the data in the table?
2. Which airport serves the most passengers?
3. Which airport has the most terminals?
4. How many flights per day are operated by Pulkovo?
5. What is the average delay at Vnukovo?
6. Which airport has the lowest passenger traffic?
7. Which airport has delays longer than 30 minutes?
Reasoning:
– Sheremetyevo has 45.8 million passengers – the highest.
– It also has 5 terminals – more than any other.
– Pulkovo has 480 daily flights.
– Vnukovo's average delay is 22 minutes.
– Koltsovo has the lowest traffic: 7.6 million.
– Tolmachevo has an average delay of 40 minutes.
Correct answers: 1, 2, 3, 4, 5, 6
Scoring:
– 2 points – all selected correctly;
– 1 point – at least 4 correct with no incorrect;
– 0 points – fewer than 4 correct or any wrong choice.
Initially, the model attempted to reuse the same data for multiple blocks and failed to provide new question types. This was corrected with the following clarification:
USER > Each task block should start with a new data table. Right now, one table is being reused across two blocks. Each task must also include a metadata table.
The model updated the data but did not yet diversify the question types. Further clarification was needed:
USER > Add tasks for the “Find the Mistake” instructional block. Use the same context. Provide full task metadata and ensure each task reflects a new cognitive challenge.
The model responded with new tasks focusing on identifying logical errors in student reasoning.
Example Task – “Find the Mistake”
Two students analyzed the table and gave the following answers:
– Ivan: “Sheremetyevo serves more passengers than all other Russian airports combined.”
– Darya: “The average flight delay at Vnukovo is less than 30 minutes, and at Koltsovo it is more than 30 minutes.”
Who is correct?
Reasoning:
– Sheremetyevo: 45.8 million passengers.
– Others combined: 30.7 + 18.1 + 19.6 + 7.6 + 8.1 = 84.1 million.
→ Ivan is wrong.
– Vnukovo: 22 minutes < 30; Koltsovo: 35 > 30
→ Darya is correct.
Answer:
Darya is correct. Ivan is incorrect.
Scoring: 2 points – full correct explanation; 1 point – correct answer only; 0 – incorrect or no answer.
The model was then prompted to generate a final assessment block that would integrate knowledge from the previous tasks and test students’ ability to reason across multiple dimensions. The resulting tasks included combined analyses of multiple indicators (e.g., terminals + traffic + delays).
The full task set demonstrates the model’s ability to follow structured reasoning chains and replicate task templates, provided it receives well-structured prompts and iterative feedback. This validates the proposed prompting strategy for semi-automated task generation in mathematics education.
The conducted dialogue with ChatGPT-4o demonstrated that the proposed prompting strategy–combining role prompting, Few-Shot learning, and Chain-of-Thought reasoning–can effectively guide a generative language model in producing pedagogically sound tasks aligned with the cognitive and curricular profile of 5th-grade students. The model was able to replicate structural task templates, generate age-appropriate reasoning chains, and adhere to specified evaluation criteria.
However, the experiment also revealed several limitations, such as the tendency to repeat tabular data across task blocks, insufficient differentiation of question types, and occasional omission of task metadata. These were addressed through iterative refinements, which proved essential for achieving consistency and depth.
Overall, the prompting strategy demonstrated the potential for scalable semi-automated task generation in mathematics education, while highlighting the necessity of human-in-the-loop oversight in ensuring content quality and alignment with learning goals.
| Рабочий адрес: | Казань, ул. Татарстан, д. 2, Учебное здание №33 |
| Номер кабинета: | 327 |
| E-mail: | tukai@yandex.ru |
| Google scholar: | https://scholar.google.com/citations?hl=en&user=bzC4HHoAAAAJ |
| ResearchGate: | https://www.researchgate.net/profile/Andrew_Danilov |
| Педагогический стаж (ППС) в ВУЗе: |
| Научно-педагогический стаж: |
| Общий стаж: |
| Непрерывный в КФУ: |