Why does my Arabic text appear as little boxes (□□□) in the exported PDF?

The font you used to author the document does not support Arabic glyphs, and the PDF export did not substitute a font that does. The boxes are the "missing glyph" placeholder. Fix: use a font with Arabic coverage (Noto Naskh Arabic, Amiri, Cairo, or any system Arabic font) in the source document before export. Verify after export by opening the PDF and selecting one of the Arabic characters — selection should highlight properly; if it does not, the glyph is missing.

How do I author bidirectional (LTR + RTL) mixed text?

Modern word processors (Word 2016+, Google Docs, Pages) handle BiDi automatically once the text contains both scripts. Type Arabic, the cursor and alignment shift right-to-left; switch back to Latin, cursor shifts left-to-right. The Unicode Bidirectional Algorithm (UAX #9) does the heavy lifting. Two common pitfalls. First, punctuation at the end of an Arabic sentence followed by English: Unicode may place the punctuation visually at the wrong end; use Unicode bidi markers (U+200E for LTR mark, U+200F for RTL mark) to force correct ordering. Second, embedded English brand names in an Arabic paragraph need surrounding bidi markers to render correctly in all readers.

How do I embed multiple language fonts without bloating the PDF?

Use font subsetting. The exporter embeds only the glyphs actually used, not the full font file. A document using 200 unique Arabic glyphs embeds ~40 KB of Noto Naskh, not the 600 KB full font. Most exporters subset by default. Verify in Acrobat: File → Properties → Fonts — each listed font should say "Embedded subset" rather than "Embedded". The size of a multi-script PDF is dominated by content (images, page count) rather than fonts.

Will the PDF render correctly on a viewer that does not have the font installed?

Yes, as long as the font is embedded in the PDF. The viewer uses the embedded font to draw glyphs; the local OS font catalogue is irrelevant. This is the whole point of PDF font embedding. The only case where the recipient sees substituted fonts is when fonts were not embedded during export — for non-Latin scripts this means missing-glyph boxes in many readers. Always embed fonts for multi-script documents; never assume the recipient's system has the necessary fonts.

How do I OCR a scanned PDF in a non-English language?

Use Tesseract with the matching language pack (Arabic: `ara`; Hebrew: `heb`; Chinese Simplified: `chi_sim`; Japanese: `jpn`; etc.). ScoutMyTool Make PDF Searchable supports 12 common language packs pre-bundled and downloads more on demand. Accuracy is typically 95–98% for clean printed Latin-script text, 92–96% for Arabic and Hebrew (RTL adds complexity), 90–95% for CJK scripts (large character sets). For mixed-language documents, enable multiple language packs simultaneously — Tesseract detects the script per region and switches dynamically.

6 min read

How to make a multi-language PDF — RTL, font fallbacks

By ScoutMyTool Editorial Team · Last updated: 2026-05-20

A bilingual product manual, an Arabic / English contract, a research paper with Japanese figure captions — each requires the PDF to render multiple scripts correctly across every recipient\'s viewer, regardless of which fonts the recipient has installed. The mechanics are straightforward once you know the rules: embed fonts that cover the scripts you use, let Unicode bidi handle the direction logic, and verify the result in two different viewers before distributing. This article walks through the script-by-script font choices, the bidirectional gotchas, and the export settings that produce reliably-rendering multi-language PDFs.

Scripts, direction, and recommended fonts

Script	Text direction	Recommended fonts
Latin (English, French, German, Spanish, etc.)	Left-to-right	Inter, Source Sans, Source Serif, Noto Sans/Serif
Arabic	Right-to-left (bidirectional with digits)	Noto Naskh Arabic, Amiri, Cairo
Hebrew	Right-to-left	Noto Sans/Serif Hebrew, David CLM
CJK (Chinese, Japanese, Korean)	Left-to-right (sometimes top-to-bottom in classical)	Noto Sans/Serif CJK SC/TC/JP/KR; Source Han Sans/Serif
Thai	Left-to-right (complex layout)	Noto Sans/Serif Thai, Sarabun
Devanagari (Hindi, Marathi, Sanskrit)	Left-to-right (complex ligatures)	Noto Sans/Serif Devanagari, Mangal

Step by step — author a bilingual Arabic/English PDF

Set the document fonts. Body: Noto Sans for Latin, Noto Naskh Arabic for Arabic. Word handles font fallback per script automatically when both are installed.
Author bilingual content. Type Arabic; the cursor and alignment shift right-to-left. Switch back to English; cursor shifts back. Use bidi markers (Insert → Symbol → U+200E / U+200F) for tricky punctuation cases.
Set per-paragraph direction for headings and standalone paragraphs. Word: Home → Paragraph → Direction. Right-to-left for Arabic-primary paragraphs.
Export with font embedding enabled. Word: Save As → PDF → Options → "Embed all fonts" (or "Best for printing" preset which embeds).
Verify in two readers. Open the exported PDF in Acrobat and Apple Preview (or Chrome). Both should render correctly; differences indicate font-embedding or bidi issues that need fixing in the source.

CJK-specific considerations

Chinese, Japanese, and Korean scripts present three additional challenges beyond Latin or Arabic scripts. First, font size: CJK characters need slightly larger font sizes (12pt minimum, 14pt preferred) than Latin scripts because individual glyphs carry more visual information per character. Second, line height: CJK scripts often look better with line-height 1.6–1.8× rather than the 1.4× standard for Latin. Third, character coverage: even "complete" CJK fonts may miss rare characters (uncommon kanji, traditional vs simplified Chinese variants, Korean archaic forms). Noto Sans/Serif CJK is the most complete open-licensed coverage; for production use, verify the specific characters in your text render correctly before mass distribution.

For Japanese specifically, distinguish kanji (Chinese-origin characters), hiragana (phonetic syllabary), katakana (phonetic syllabary for foreign words), and Latin (English/loanwords). A complete Japanese PDF font covers all four. For Chinese, distinguish Simplified (mainland, Singapore) from Traditional (Taiwan, Hong Kong); they share many characters but several thousand differ. Pick the right variant for your audience.

Language tagging for accessibility

PDFs supporting multiple languages should tag each text region with its language (per ISO 14289 / PDF/UA). This lets screen readers switch pronunciation engines mid-document: read the English paragraph in English, switch to Arabic pronunciation for the Arabic paragraph. Without tagging, screen readers attempt to read non-English text with English pronunciation rules — unintelligible. Most authoring tools tag languages automatically when the document uses Word/Docs language-specific paragraph styles; verify after export by opening Acrobat → View → Tags and checking that paragraphs carry their language attribute.

Language tagging also helps Google index multi-language PDFs correctly. Without tags, Google may guess the document language from the dominant script (Latin = English); with tags, Google can serve the document in language-specific search results. For multinational publishers, the indexing benefit is significant — tag every paragraph correctly and the document surfaces in language-specific SERPs across all the languages it contains.

FAQ

Why does my Arabic text appear as little boxes (□□□) in the exported PDF?: The font you used to author the document does not support Arabic glyphs, and the PDF export did not substitute a font that does. The boxes are the "missing glyph" placeholder. Fix: use a font with Arabic coverage (Noto Naskh Arabic, Amiri, Cairo, or any system Arabic font) in the source document before export. Verify after export by opening the PDF and selecting one of the Arabic characters — selection should highlight properly; if it does not, the glyph is missing.
How do I author bidirectional (LTR + RTL) mixed text?: Modern word processors (Word 2016+, Google Docs, Pages) handle BiDi automatically once the text contains both scripts. Type Arabic, the cursor and alignment shift right-to-left; switch back to Latin, cursor shifts left-to-right. The Unicode Bidirectional Algorithm (UAX #9) does the heavy lifting. Two common pitfalls. First, punctuation at the end of an Arabic sentence followed by English: Unicode may place the punctuation visually at the wrong end; use Unicode bidi markers (U+200E for LTR mark, U+200F for RTL mark) to force correct ordering. Second, embedded English brand names in an Arabic paragraph need surrounding bidi markers to render correctly in all readers.
How do I embed multiple language fonts without bloating the PDF?: Use font subsetting. The exporter embeds only the glyphs actually used, not the full font file. A document using 200 unique Arabic glyphs embeds ~40 KB of Noto Naskh, not the 600 KB full font. Most exporters subset by default. Verify in Acrobat: File → Properties → Fonts — each listed font should say "Embedded subset" rather than "Embedded". The size of a multi-script PDF is dominated by content (images, page count) rather than fonts.
Will the PDF render correctly on a viewer that does not have the font installed?: Yes, as long as the font is embedded in the PDF. The viewer uses the embedded font to draw glyphs; the local OS font catalogue is irrelevant. This is the whole point of PDF font embedding. The only case where the recipient sees substituted fonts is when fonts were not embedded during export — for non-Latin scripts this means missing-glyph boxes in many readers. Always embed fonts for multi-script documents; never assume the recipient's system has the necessary fonts.
How do I OCR a scanned PDF in a non-English language?: Use Tesseract with the matching language pack (Arabic: `ara`; Hebrew: `heb`; Chinese Simplified: `chi_sim`; Japanese: `jpn`; etc.). ScoutMyTool Make PDF Searchable supports 12 common language packs pre-bundled and downloads more on demand. Accuracy is typically 95–98% for clean printed Latin-script text, 92–96% for Arabic and Hebrew (RTL adds complexity), 90–95% for CJK scripts (large character sets). For mixed-language documents, enable multiple language packs simultaneously — Tesseract detects the script per region and switches dynamically.

Citations

Unicode Standard Annex #9 — Bidirectional Algorithm (UAX #9).
Google Noto Fonts — open-licensed font family covering 1,000+ languages.
ISO 32000-1:2008 — "Document management — Portable document format" — font embedding mechanics.
Tesseract OCR — open-source OCR with 100+ language packs.

Multi-language OCR in your browser

ScoutMyTool Make PDF Searchable supports 12 pre-bundled language packs and downloads more on demand. Client-side, no upload.

Open Make PDF Searchable →

How to make a multi-language PDF — RTL languages, font fallbacks