Given that CaLM is designed for language models, our selection of modes carefully consider three distinct subcategories within the text mode: Natural, Symbolic, and Mathematical.
Natural includes the conventional causal tasks that are articulated and responded to in the language commonly used by people. This mode focuses on the intuitive, everyday use of language, facilitating the assessment of how effectively language models understand and generate responses that align with typical human communication.
Symbolic refers to the causal tasks presented in a symbolic form that does not contain specific physical meaning (e.g., a causal graph represented within symbolic mode would be ''A causes B, B causes D, C causes D''). The model's responses are given in a mixed format of natural language and symbolic representation. Using symbols to represent variables serves a dual purpose: (1) Firstly, it aligns with traditional cognitive reasoning[1], where abstract symbols and logical structures enable reasoning beyond specific contexts. This approach leverages the generality and clarity of symbolic representations, facilitating logical inference and conceptual manipulation without the ambiguities of natural language. (2) Secondly, this symbolic representation effectively prevents the model from memorizing biases within the training data, offering a more accurate measure of the model's genuine causal inference capabilities. By abstracting variables into symbols, the focus shifts from content memorization to the application of logical reasoning, providing a clearer evaluation of the model’s ability to deduce causality from causal graph. The usage of symbolic mode not only assesses the model's reasoning skills in a controlled environment but also paves the way for the development of models that are both more robust and capable of generalizing beyond their training datasets.
Mathematical consists of causal tasks that involve math concepts, requiring the model to execute mathematical operations and respond with answers in both probabilistic values and natural language. Numerous studies have evaluated the mathematical abilities of language models, revealing that although these models excel in many natural language processing tasks, they still face significant challenges when solving mathematical problems[2,3,4,5,6,7,8,9,10]. The reason we employ Mathematical mode is that mathematical reasoning is fundamental to assessing the cognitive capabilities that underpin human intelligence. This mode tests models beyond mere linguistic fluency, probing their logical structure and capacity for conceptual understanding. Employing Mathematical mode not only highlights the current capabilities and limitations of language models in mimicking sophisticated cognitive functions, but also guides the development of more advanced models capable of complex thought processes akin to human reasoning.
[1] Garcez, A. S., Lamb, L. C., and Gabbay, D. M. Neural-symbolic cognitive reasoning. Springer Science & Business Media, 2008.
[2] Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
[3] Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
[4] Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step. In The Twelfth International Conference on Learning Representations, 2023.
[5] Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y., Lundberg, S., et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
[6] Dao, X.-Q. and Le, N.-B. Investigating the effectiveness of chatgpt in mathematical reasoning and problem solving: Evidence from the vietnamese national high school graduation examination. arXiv preprint arXiv:2306.06331, 2023.
[7] Wei, T., Luan, J., Liu, W., Dong, S., and Wang, B. Cmath: can your language model pass chinese elementary school math test? arXiv preprint arXiv:2306.16636, 2023a.
[8] Wu, Y., Jia, F., Zhang, S., Wu, Q., Li, H., Zhu, E., Wang, Y., Lee, Y. T., Peng, R., and Wang, C. An empirical study on challenging math problem solving with gpt-4. arXiv preprint arXiv:2306.01337, 2023a.
[9] Yuan, Z., Yuan, H., Tan, C., Wang, W., and Huang, S. How well do large language models perform in arithmetic tasks? arXiv preprint arXiv:2304.02015, 2023b.
[10] Yu, L., Jiang, W., Shi, H., Jincheng, Y., Liu, Z., Zhang, Y., Kwok, J., Li, Z., Weller, A., and Liu, W. Metamath: Bootstrap your own mathematical questions for large language models. In The Twelfth International Conference on Learning Representations, 2023.