${article.titleCn}

doi:10.52601/bpr.2024.240004

Abacha BA, Hasan SA, Datla VV, Demner-Fushman D, Müller H (2019) Vqa-med: overview of the medical visual question answering task at imageclef 2019. Proceedings of Conference and Labs of the Evaluation Forum. https://ceur-ws.org/Vol-2380/paper_272.pdf

Alayrac JB, Donahue J, Luc P, Miech A, Barr I, Hasson Y, Lenc K, Mensch A, Millican K, Reynolds M, Ring R, Rutherford E, Cabi S, Han T, Gong Z, Samangooei S, Moteiro M, Menick J, Borgeaud S, Brock A, Nematzadeh A, Sharifzadeh S, Binkowski M, Barreira R, Vinyals O, Zisserman A (2022) Flamingo: a visual language model for few-shot learning. Adv Neural Inf Process Syst 35: 23716−23736

Bosma M, Mishra G, Roberts A, Barham P, Chung HW, Sutton C, Gehrmann S, Schuh P, Shi K, Tsvyashchenko S, Maynez J, Rao A, Barnes P, Tay Y, Shazeer N, Prabhakaran V, Reif E, Du N, Hutchinson B, Pope R, Bradbury J, Austin J, Isard M, Gur-Ari G, Yin P, Duke T, Levskaya A, Ghemawat S, Dev S, Michalewski H, Garcia X, Misra V, Robinson K, Fedus L, Zhou D, Ippolito D, Luan D, Lim H, Zoph B, Spiridonov A, Sepassi R, Dohan D, Agrawal S, Omernick M, Dai AM, Pillai TS, Pellat M, Lewkowycz A, Moreira E, Child R, Polozov O, Lee K, Zhou Z, Wang X, Saeta B, Diaz M, Firat O, Catasta M, Wei J, Meier-Hellstern K, Eck D, Dean J, Petrov S, Fiedel N (2023) Palm: Scaling language modeling with pathways. J Mach Learn Res 24(240): 1−113

Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33: 1877−1901

Cai X, Liu S, Han J, Yang L, Liu Z, Liu T (2021) ChestXRayBERT: a pretrained language model for chest radiology report summarization. IEEE Trans Multimed 25: 845−855 doi: 10.1109/TMM.2021.3132724

Chen J, Zhu D, Shen X, Li X, Liu Z, Zhang P, Krishnamoorthi R, Chandra V, Xiong Y, Elhoseiny M (2023a) Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv: 2310.09478. https://doi.org/10.48550/arXiv.2310.09478

Chen YC, Li L, Yu L, El Kholy A, Ahmed F, Gan Z, Liu J (2020) Uniter: universal image-text representation learning. European conference on computer vision. pp. 104−120

Chen Z, Cano AH, Romanou A, Bonnet A, Matoba K, Salvi F, Pagliardini M, Fan S, Köpf A, Mohtashami A, Sallinen A, Sakhaeirad A, Swamy V, Krawczuk I, Bayazit D, Marmet A, Montariol S, Hartley MA, Jaggi M, Bosselut A (2023b) MEDITRON-70B: scaling medical pretraining for large language model. arXiv: 2311.16079. https://doi.org/10.48550/arXiv.2311.16079

Cheng J, Ye J, Deng Z, Chen J, Li T, Wang H, Su Y, Huang Z, Chen J, Jiang L, Sun H, He J, Zhang S, Zhu M, Qiao Y (2023) SAM-Med2D. arXiv: 2308.16184. https://doi.org/10.48550/arXiv.2308.16184

Cui Y, Che W, Liu T, Qin B, Wang S, Hu G (2020) Revisiting pre-trained models for Chinese natural language processing. arXiv: 2004.13922. https://doi.org/10.48550/arXiv.2004.13922

Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv: 1810.04805. https://doi.org/10.48550/arXiv.1810.04805

Dong L, Yang N, Wang W, Wei F, Liu X, Wang Y, Gao J, Zhou M, Hon HW (2019) Unified language model pre-training for natural language understanding and generation. Proceedings of the 33rd International Conference on Neural Information Processing Systems. pp. 13063–13075

Driess D, Xia F, Sajjadi MS, Lynch C, Chowdhery A, Ichter B, Wahid A, Tompson J, Vuong Q, Yu T, Huang W, Chebotar Y, Sermanet P, Duckworth D, Levine S, Vanhoucke V, Hausman K, Toussaint M, Greff K, Zeng A, Mordatch I, Florence P (2023) PaLM-E: an embodied multimodal language model. arXiv: 2303.03378. https://doi.org/10.48550/arXiv.2303.03378

Du N, Huang Y, Dai AM, Tong S, Lepikhin D, Xu Y, Krikun M, Zhou Y, Yu AW, Firat O, Zoph B, Fedus L, Bosma M, Zhou Z, Wang T, Wang YE, Webster K, Pellat M, Robinson K, Meier-Hellstern K, Duke T, Dixon L, Zhang K, Le QV, Wu Y, Chen Z, Cui C (2022) Glam: efficient scaling of language models with mixture-of-experts. Proceedings of the 39th International Conference on Machine Learning. pp. 5547−5569

Eslami S, de Melo G, Meinel C (2021) Does CLIP benefit visual question answering in the medical domain as much as it does in the general domain? arXiv: 2112.13906. https://doi.org/10.48550/arXiv.2112.13906

Gardères F, Ziaeefard M, Abeloos B, Lecue F (2020) Conceptbert: concept-aware representation for visual question answering. Findings of the Association for Computational Linguistics: EMNLP 2020. pp. 489−498

Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X, Naumann T, Gao J, Poon H (2021) Domain-specific language model pretraining for biomedical natural language processing. ACM Trans ComputHealthc 3(1): 1−23

Hu X, Gu L, Kobayashi K, An Q, Chen Q, Lu Z, Su C, Harada T, Zhu Y (2023) Interpretable medical image visual question answering via multi-modal relationship graph learning. arXiv: 2302.09636. https://doi.org/10.48550/arXiv.2302.09636

Kanakarajan KR, Kundumani B, Sankarasubbu M (2021) BioELECTRA: pretrained biomedical text encoder using discriminators. Proceedings of the 20th Workshop on Biomedical Language Processing. pp. 143−154

Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, Xiao T, Whitehead S, Berg AC, Lo WY, Dolla´r P, and Girshick R (2023) Segment anything. arXiv: 2304.02643. https://doi.org/10.48550/arXiv.2304.02643

Kim S, Joo SJ, Kim D, Jang J, Ye S, Shin J, Seo M (2023) The COT COLLECTION: improving zero-shot and few-shot learning of language models via chain-of-thought fine-tuning. arXiv: 2305.14045. https://doi.org/10.48550/arXiv.2305.14045

Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R (2019) Albert: a lite bert for self-supervised learning of language representations. arXiv: 1909.11942. https://doi.org/10.48550/arXiv.1909.11942

Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J (2019) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. arXiv: 1901.08746. https://doi.org/10.48550/arXiv.1901.08746

Li C, Wong C, Zhang S, Usuyama N, Liu H, Yang J, Naumann T, Poon H, Gao J (2023a) LLaVA-Med: large language-and-vision assistant for biomedicine. arXiv: 2304.04342. https://doi.org/10.48550/arXiv.2304.04342

Liévin V, Hother CE, Motzfeldt AG, Winther O (2022) Can large language models reason about medical questions? arXiv: 2207.08143. https://doi.org/10.48550/arXiv.2207.08143

Li P, Liu G, Tan L, Liao J, Zhong S (2023b) Self-supervised vision-language pretraining for medial visual question answering. arXiv: 2211.13594. https://doi.org/10.48550/arXiv.2211.13594

Liu Y, Wang Z, Xu D, Zhou L (2023) Q2ATransformer: Improving medical vqa via an answer querying decoder. arXiv: 2304.01611. https://doi.org/10.48550/arXiv.2304.01611

Lu J, Batra D, Parikh D, Lee S (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv: 1908.02265. https://doi.org/10.48550/arXiv.1908.02265

Luo R, Sun L, Xia Y, Qin T, Zhang S, Poon H, and Liu TY (2022) BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief Bioinform 23(6): bbac409. https://doi.org/10.1093/bib/bbac409 doi: 10.1093/bib/bbac409

Luo Y, Zhang J, Fan S, Yang K, Wu Y, Qiao M, Nie Z (2023) BioMedGPT: open multimodal generative pre-trained transformer for biomedicine. arXiv: 2308.09442. https:// doi.org/10.48550/arXiv.2308.09442

Ma L, Han J, Wang Z, Zhang D (2023) CephGPT-4: an interactive multimodal cephalometric measurement and diagnostic system with visual large language model. arXiv: 2307.07518. https://doi.org/10.48550/arXiv.2307.07518

Manmadhan S, Kovoor BC (2023) Parallel multi-head attention and term-weighted question embedding for medical visual question answering. Mult Tools Appl 82: 34937−34958 doi: 10.1007/s11042-023-14981-2

Moor M, Huang Q, Wu S, Yasunaga M, Zakka C, Dalmia Y, Reis EP, Rajpurkar P, Leskovec J (2023) Med-Flamingo: a multimodal medical few-shot learner. arXiv: 2307.15189. https://doi.org/10.48550/arXiv.2307.15189

Nori H, King N, McKinney SM, Carignan D, Horvitz E (2023) Capabilities of GPT-4 on medical challenge problems. arXiv: 2303.13375. https://doi.org/10.48550/arXiv.2303.13375

OpenAI (2022) Introducing ChatGPT. https://openai.com/blog/chatgpt

Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, Zhang C, Agarwal S, Slama K, Ray A, Schulman J, Hilton J, Kelton F, Miller L, Simens M, Askell A, Welinder P, Christiano P, Leike J, and Lowe R (2022) Training language models to follow instructions with human feedback. Adv Neural Inf Process Syst 35: 27730−27744

Radford A, Kim J W, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G, and Sutskever I (2021) Learning transferable visual models from natural language supervision. Proceedings of the 38th International Conference on Machine Learning. pp. 8748−8763

Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8): 9. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 21(1): 5485−5551

Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, Sutskever I (2021) Zero-shot text-to-image generation. Proceedings of the 38th International Conference on Machine Learning. pp. 8821−8831

Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B (2022) High-resolution image synthesis with latent diffusion models. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684−10695

Scao TL, Fan A, Akiki C, Pavlick E, Ili ́c S, Hesslow D, Castagné R, Luccioni AS, Yvon F, Gallé M, Tow J, Rush AM, Biderman S, Webson A, Ammanamanchi PS, Wang T, Sagot B, Muennighoff N, Moral AV, Ruwase O, Bawden R, Bekman S, Major AM, Wolf T, Beltagy I, Nguyen H, Saulnier L, Tan S, Suarez PO, Sanh V, Lauren ̧con H, Jernite Y, Launay J, Mitchell M, Raffel C (2022) BLOOM: a 176b-parameter open-access multilingual language model. arXiv: 2211.05100. https://doi.org/10.48550/arXiv.2211.05100

Sharma D, Purushotham S, Reddy CK (2021) MedFuseNet: an attention-based multimodal deep learning model for visual question answering in the medical domain. Sci Rep 11(1):19826. https://doi.org/10.1038/s41598-021-98390-1

Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, Tanwani A, Cole-Lewis H, Pfohl S, Payne P, Seneviratne M, Gamble P, Kelly C, Schärli N, Chowdhery A, Mansfield P, Agüera y Arcas B, Webster D, Corrado GS, Matias Y, Chou K, Gottweis J, Tomasev N, Liu Y, Rajkomar A, Barral J, Semturs C, Karthikesalingam A, Natarajan V (2022) Large language models encode clinical knowledge. arXiv: 2212.13138. https://doi.org/10.48550/arXiv.2212.13138

Singhal K, Tu T, Gottweis J, Sayres R, Wulczyn E, Hou L, Clark K, Pfohl S, Cole-Lewis H, Neal D, Schaekermann M, Wang A, Amin M, Lachgar S, Mansfield P, Prakash S, Green B, Dominowska E, Aguera y Arcas B, Tomasev N, Liu Y, Wong R, Semturs C, Mahdavi SS, Barral J, Webster D, Corrado GS, Matias Y, Azizi S, Karthikesalingam A, Natarajan V (2023) Towards expert-level medical question answering with large language models. arXiv: 2305.09617. https://doi.org/10.48550/arXiv.2305.09617

Tan H, Bansal M (2019) Lxmert: learning cross-modality encoder representations from transformers. arXiv: 1908.07490. https://doi.org/10.48550/arXiv.1908.07490

Taylor R, Kardas M, Cucurull G, Scialom T, Hartshorn A, Saravia E, Poulton A, Kerkez V, Stojnic R (2022) Galactica: a large language model for science. arXiv: 2211.09085. https://doi.org/10.48550/arXiv.2211.09085

Thawkar O, Shaker A, Mullappilly SS, Cholakkal H, Anwer RM, Khan S, Laaksonen J, Khan FS (2023) XrayGPT: chest radiographs summarization using large medical vision-language models. arXiv: 2306.07971. https://doi.org/10.48550/arXiv.2306.07971

Thoppilan R, De Freitas D, Hall J, Shazeer N, Kulshreshtha A, Cheng HT, Jin A, Bos T, Baker L, Du Y, Li Y, Lee H, Zheng HS, Ghafouri A, Menegali M, Huang Y, Krikun M, Lepikhin D, Qin J, Chen D, Xu Y, Chen Z, Roberts A, Bosma M, Zhao V, Zhou Y, Chang CC, Krivokon I, Rusch W, Pickett M, Srinivasan P, Man L, Meier-Hellstern K, Morris MR, Doshi T, Delos Santos R, Duke T, Soraker J, Zevenbergen B, Prabhakaran V, Diaz M, Hutchinson B, Olson K, Molina A, Hoffman-John E, Lee J, Aroyo L, Rajakumar R, Butryna A, Lamm M, Kuzmina V, Fenton J, Cohen A, Bernstein R, Kurzweil R, Aguera-Arcas B, Cui C, Croak M, Chi E, Le Q (2022) Lamda: language models for dialog applications. arXiv: 2201.08239. https://doi.org/10.48550/arXiv.2201.08239

Tian Y, Gan R, Song Y, Zhang J, Zhang Y (2023) CHIMED-GPT: a chinese medical large language model with full training regime and better alignment to human preferences. arXiv: 2311.06025. https://doi.org/10.48550/arXiv.2311.06025

Touvron H, Lavril T, Izacard G, Martinet X, Lachaux MA, Lacroix T, Rozière B, Goyal N, Hambro E, Azhar F, Rodriguez A, Joulin A, Grave E, and Lample G (2023) Llama: open and efficient foundation language models. arXiv: 2302.13971. https://doi.org/10.48550/arXiv.2302.13971

Tu T, Azizi S, Driess D, Schaekermann M, Amin M, Chang PC, Carroll A, Lau C, Tanno R, Ktena I, Mustafa B, Chowdhery A, Liu Y, Kornblith S, Fleet D, Mansfield P, Prakash S, Wong R, Virmani S, Semturs C, Mahdavi SS, Green B, Dominowska E, Aguera y Arcas B, Barral J, Webster D, Corrado GS, Matias Y, Singhal K, Florence P, Karthikesalingam A, Natarajan V (2023) Towards generalist biomedical AI. arXiv: 2307.14334. https://doi.org/10.48550/arXiv.2307.14334

Wang G, Yang G, Du Z, Fan L, Li X (2023a) ClinicalGPT: large language models finetuned with diverse medical data and comprehensive evaluation. arXiv: 2306.09968. https://doi.org/10.48550/arXiv.2306.09968

Wang Z, Wu Z, Agarwal D, Sun J (2023b) MedCLIP: contrastive learning from unpaired medical images and text. arXiv: 2210.10163. https://doi.org/10.48550/arXiv.2210.10163

Wei J, Wang X, Schuurmans D, Bosma M, Xia F, Chi E, Le Q, and Zhou D (2022) Chain-of-thought prompting elicits reasoning in large language models. Adv Neural Inf Process Syst 35: 24824−24837

Wu C, Lin W, Zhang X, Zhang Y, Wang Y, Xie W (2023a) PMC-LLaMA: an open-source language model for medical applications. arXiv: 2304.14454. https://doi.org/10.48550/arXiv.2304.14454

Wu S, Fei H, Qu L, Ji W, Chua TS (2023b) NExT-GPT: any-to-any multimodal LLM. arXiv: 2309.05519. https://doi.org/10.48550/arXiv.2309.05519

Wu Y, Wang S, Yang H, Zheng T, Zhang H, Zhao Y, Qin B (2023c) An early evaluation of gpt-4v(ision). arXiv: 2310.16534. https://doi.org/10.48550/arXiv.2310.16534

Xu H, Ghosh G, Huang PY, Arora P, Aminzadeh M, Feichtenhofer C, Metze F, Zettlemoyer L (2021). Vlm: task-agnostic video-language model pre-training for video understanding. arXiv: 2105.09996. https://doi.org/10.48550/arXiv.2105.09996

Xu M (2023) MedicalGPT: training medical GPT models. https://github.com/shibing624/MedicalGPT

Yasunaga M, Bosselut A, Ren H, Zhang X, Manning CD, Liang PS, Leskovec J (2022a) Deep bidirectional language-knowledge graph pretraining. Adv Neural Inf Process Syst 35: 37309−37323

Yasunaga M, Leskovec J, Liang P (2022b) LinkBERT: pretraining language models with document links. arXiv: 2203.15827. https://doi.org/10.48550/arXiv.2203.15827

Ye F, Liu G, Wu X, Wu L (2023) AltDiffusion: a multilingual text-to-image diffusion model. arXiv: 2308.09991. https://doi.org/10.48550/arXiv.2308.09991

Yu Z, Yu J, Cui Y, Tao D, Tian Q (2019) Deep modular co-attention networks for visual question answering. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6281−6290

Zhan LM, Liu B, Fan L, Chen J, Wu XM (2020) Medical visual question answering via conditional reasoning. In Proceedings of the 28th ACM International Conference on Multimedia. pp. 2345−2354

Zhang S, Roller S, Goyal N, Artetxe M, Chen M, Chen S, Dewan C, Diab M, Li X, Lin XV, Mihaylov T, Ott M, Shleifer S, Simig D, Koura PS, Sridhar A, Wang T, Zettlemoyer L (2022) OPT: open pre-trained transformer language models. arXiv: 2205.01068. https://doi.org/10.48550/arXiv.2205.01068

Zhang S, Xu Y, Usuyama N, Bagga J, Tinn R, Preston S, Rao R, Wei M, Valluri N, Wong C, Lungren MP, Naumann T, Poon H (2023) Large-scale domain-specific pretraining for biomedical vision-language processing. arXiv: 2303.00915. https://doi.org/10.48550/arXiv.2303.00915

Zhao H, Cai Z, Si S, Ma X, An K, Chen L, Liu Z, Wang S, Han W, Chang B (2023) MMICL: empowering vision-language model with multi-modal in-context learning. arXiv: 2309.07915. https://doi.org/10.48550/arXiv.2309.07915

Zhu D, Chen J, Shen X, Li X, Elhoseiny M (2023) MiniGPT-4: enhancing vision-language understanding with advanced large language models. arXiv: 2304.10592. https://doi.org/10.48550/arXiv.2304.10592