Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persian calendar conversion uses wrong algorithm #4713

Closed
roozbehp opened this issue Mar 20, 2024 · 18 comments · Fixed by #4796
Closed

Persian calendar conversion uses wrong algorithm #4713

roozbehp opened this issue Mar 20, 2024 · 18 comments · Fixed by #4796
Labels
C-calendar Component: Calendars

Comments

@roozbehp
Copy link
Contributor

roozbehp commented Mar 20, 2024

Looking at https:/unicode-org/icu4x/blob/main/components/calendar/src/persian.rs, it looks like the “proposed” 2820-year cycle arithmetic calendar has been implemented.

Unfortunately, that algorithm is wrong and fails in exactly a year from now, on March 20, 2025. While it should return 30 Esfand 1403 AP, it mistakenly returns 1 Farvardin 1404 AP.

The correct algorithm, as specified in Calendrical Calculations book, is the astronomical algorithm. That's what is followed on the ground (although with an unspecified location). Since that is too complex to implement, a simple arithmetic algorithm that works at least until ~2090 CE has been implemented in ICU4C and ICU4J. (See the comments at top of https:/unicode-org/icu/blob/main/icu4j/main/core/src/main/java/com/ibm/icu/util/PersianCalendar.java). That simple algorithm should be implemented in ICU4X as soon as possible, since we just entered the year 1403 AP, which is leap on the ground, but ICU4X miscalculates as non-leap.

Here's a quote from an email I sent to @echeran, @sffc, @Manishearth, @mihnita, and others on June 21, 2023:

The Persian calendar doesn't need a complex astronomical implementation for the foreseeable future. I don't recommend implementing an astronomical Persian calendar, except for research purposes. Instead, implement the Persian calendar the same way ICU and ICU4J implement it, which is a simple 33-year cycle. It matches the practice for many years into the future.

@sffc sffc added the C-calendar Component: Calendars label Mar 26, 2024
@sffc
Copy link
Member

sffc commented Mar 26, 2024

Thank you @roozbehp for highlighting this.

As you noted, Calendrical Calculations specifies two Persian calendar methods. The algorithm in section 15.2 uses astronomical calculations:

persian-new-year-on-or-before(date) = solar-longitude (midday-in-tehran (day)) <= spring + 2deg

The other algorithm in 15.3 uses the 2820 cycle.

I don't see one in the book that uses a pure 33-year cycle. Is the ICU implementation the standard for that? What are the other pros and cons of that approximation versus the 2820 approximation?

When we implemented the calendrical calculations in ICU4X, we endeavored to follow the Reingold algorithms in order to not be the source of truth of any approximations.

@sffc
Copy link
Member

sffc commented Mar 26, 2024

My understanding of the situation is: the Reingold text published an approximation for the Persian calendar, which we are calling the 2820 approximation, and that approximation is wrong in 2025 CE. @roozbehp suggests a different approximation, which works between 1925 and 2090 CE.

Here's one possible approach we could take. In the Islamic and Chinese calendars, we implement a table-based fast path, where we ship precomputed year/month data for a certain range (I think +/- 100ish years by default), and fall back to the astronomical calculations for dates outside that range. We could move Persian over to using that framework, and throw out the 2820 approximation.

What do you think @roozbehp @Manishearth @echeran?

@echeran
Copy link
Contributor

echeran commented Mar 26, 2024

I ran a small test and confirmed that ICU4J does indeed give us Persian date 1403-12-30 for Gregorian date 2025-03-20, which matches the book's persian-from-fixed function that implements the astronomical calendar. Meanwhile, we saw that the 2820-based arithmetic algorithm, which is arithmetic-persian-from-fixed in the book, gives Persian date 1404-01-01. So in other words, a Persian date gets skipped when going from consecutive Gregorian days:

offset Gregorian ICU4J astronomical (persian-from-fixed) 2820-based (arithmetic-persian-from-fixed)
-1 2025-03-19 1403-12-29 1403-12-29 1403-12-29
0 2025-03-20 1403-12-30 1403-12-30 1404-01-01
1 2025-03-21 1404-01-01 1404-01-01 1404-01-02
40 2025-04-29 1404-02-09 1404-02-09 1404-02-10
400 2026-04-24 1405-02-04 1405-02-04 1405-02-04

@sffc sffc added this to the ICU4X 2.0 milestone Mar 27, 2024
@sffc sffc added the discuss-priority Discuss at the next ICU4X meeting label Mar 27, 2024
@roozbehp
Copy link
Contributor Author

roozbehp commented Mar 27, 2024

Trying to answer the questions raised, as well as adding a few points:

  1. "I don't see one in the book that uses a pure 33-year cycle." That is actually briefly mentioned on top of page 265: "this 33-year cycle agrees with the astronomical calendar over an even longer period, 1046–1468 AP (1621–2421 Gregorian) [sic?]." It's unfortunate that the book doesn't spend more pages on this simpler and more accurate arithmetic calendar instead of the 2820-year cycle one, but the 33-year cycle one is very well-known in Iran and is widely implemented in software.

  2. "Is the ICU implementation the standard for that?"
    Unfortunately I don't understand the question here.

  3. "What are the other pros and cons of that approximation versus the 2820 approximation?" The pros are that a) it's accurate for a longer period of time; b) it's simpler and slightly faster; and c) It's the same algorithm implemented in ICU and ICU4J. I don't know of any cons compared to the 2820-year cycle.

  4. "When we implemented the calendrical calculations in ICU4X, we endeavored to follow the Reingold algorithms in order to not be the source of truth of any approximations." Unfortunately we can't close our eyes to the actual usage on the ground. All the paper calendars published in Iran have 1403 AP as a leap year, which doesn't match the calculations of the 2820-year cycle.

  5. "@roozbehp suggests a different approximation, which works between 1925 and 2090 CE." That is correct, with the caveat that it may work for even a longer period of time, more on that below.

  6. "we ship precomputed year/month data for a certain range (I think +/- 100ish years by default), and fall back to the astronomical calculations for dates outside that range. We could move Persian over to using that framework, and throw out the 2820 approximation. What do you think?" That will work for me, with one caveat: the astronomical algorithm from the book will need to be slightly modified. More on that below.

  7. The Calendrical Calculations book, in the footnote on page 259, mentions that someone wrote to them and claimed that the 52.5 East meridian should be used instead of Tehran as the locale for the astronomical calculations, but that the authors' reading of the law doesn't specify a locale. While it's correct that the law doesn't specify a locale, I can confirm from conversations with calendar authorities in Iran as well as several Persian publications that the Iranian calendar authorities use the 52.5 E meridian in their calculations as a "neutral" location. The book mentions that the differences are not too many, but that the next such difference happens in 2091 CE. Indeed, according to the astronomical calculations in the book, 1469 AP would be non-leap and 1470 AP would be leap (this matches the tables published by the same authors in "Calendrical Tabulations 1900–2200"). But according to the Iranian calendar authorities, 1469 AP is leap and 1470 AP is not: https://calendar.ut.ac.ir/Fa/News/Data/Doc/KabiseShamsi1206-1498-new.pdf. If you make that locale change to the astronomical algorithm in the book, it will match the table published by the Iranian calendar authority linked above. (A good test date is 2091-03-20. According to the Iranian calendar authority tables and the modified astronomical algorithm using the 52.5 E meridian, it will be 1469/12/30 AP. According to the original astronomical algorithm and the Calendrical Tabulations book, it would be 1470/1/1 AP.)

  8. I have a partial Python port of the Lisp code from Calendrical Calculations at https:/roozbehp/persiancalendar, which I have modified according to the facts on the ground (which is just changing the locale used for calculating the true noon to 52.5 E meridian). The output of that code matches the table published by the Iranian calendar authority. It also matches the 33-year cycle algorithm implemented in ICU/ICU4J until early 2124 CE (the first day when they don't match is 2124-03-20, which should be 1503/1/1 AP, but the 33-year cycle incorrectly calculates as 1502/12/30 AP.)

  9. I think the quick solution going forward is simply replacing the algorithm used in ICU4X by the one used in ICU/ICU4J. That covers us until early 2124 CE, about a hundred years from now. (For the years before 1925 CE it really doesn't matter what we do, because it's proleptic anyway: this calendar didn't exist in this form until 1925 CE.)

  10. When we find the time, we can add (a modified) astronomical code for dates on or after 2124-03-20. We have about a hundred years to do that! 😉

@Manishearth
Copy link
Member

Is the ICU implementation the standard for that?

IMO this is a misunderstanding of how these calendar algorithms work: For most of them the standard is whatever the calendar authorities1 say (often a proxy for what happens in the actual skies), because that's actually what people use on the ground. Aside from the Hebrew calendar2 most of these things have no "standard" algorithm and rather have a general set of rules. It is rare that the relevant calendar authorities will actually publish an algorithm, it is sometimes even possible that the authorities will use different algorithms at different times.

This is also a reason as to why I'm not too bothered with ICU4X's Chinese calendar implementation diverging from ICU4C in a couple centuries. There's no right answer for those dates.

Basically, whenever there is ground truth to follow, we should follow it.

I think the quick solution going forward is simply replacing the algorithm used in ICU4X by the one used in ICU/ICU4J. That covers us until early 2124 CE, about a hundred years from now. (For the years before 1925 CE it really doesn't matter what we do, because it's proleptic anyway: this calendar didn't exist in this form until 1925 CE.)

This seems good.

Footnotes

  1. I say "authorities" loosely, in some cases it's more just "whoever is in the business of publishing calendars" like कालनिर्णय in India. They probably do some amount of coordination for future years and will notice when there is potential for mismatch.

  2. Which follows a very rigorous algorithm with well-defined approximations, where the approximations are intentional and expected. Put differently, that calendar drifts out of synch with the moon and it's okay with it (though if it gets too bad I'd expect them to tweak it eventually)

@sffc
Copy link
Member

sffc commented Mar 29, 2024

I apologize for sloppily using the word "standard." What I intended to ask was whether the ICU code is the source of truth for this approximation, as opposed to some textbook or wikipedia page or blog post Roozbeh made or something. I.e., what is the source that goes into the documentation.

@sffc
Copy link
Member

sffc commented Mar 29, 2024

In general I prefer not implementing approximations in the ICU4X code and instead using speedups like we've done with the Chinese calendar. I consider it a bug that the Persian code currently uses an approximation, and the bug is worse given that the approximation fails next year, but any approximation is still a bug. We have a system for implementing fast, non-approximations for calendrical calculations, so why not use it.

@sffc
Copy link
Member

sffc commented Apr 4, 2024

ICU4X Working Group discussion:

  • @Manishearth - The source of truth is not the astronomy, it's what people use in practice on the ground. In the case of the Persian calendar, that is provided by the government.
  • @younies - Should we be pulling from a cloud service?
  • @sffc - Even though we should provided what the ground truth is of what people use in practice, that information is based on the moon, which is the astronomy.
  • @Manishearth - That makes the assumption that the astronomy is correct. When the math is tricky, like within the period of floating point errors, humans have to make a call.
  • @sffc - I think the astronomy code is more correct than the approximation provided in the issue and the book approximation.
  • @Manishearth - I disagree because saying it is more correct because the only form of correctness that matters here is for past dates and near future dates. Far future dates are affected by ground truth and space truth concerns.
  • @sffc - If it is okay to make an approximation that works for only the next 100 years, then why can't we do the same thing for the Islamic calendar? Why is the Persian calendar a special case? If we come up with approximations for the Islamic and other such calendars, we can save on code size and code maintanence.
  • @Manishearth - The Persian calendar has more complex logic. Some of the Islamic calendars are based on algorithms solely, like one of the Hebrew calendars, which is fine. Having caching for these calendars is fine.
  • @sffc - It makes me concerned that we relied on the book's algorithms to write all our code, and now I hear strong support for the idea that a simpler approximation would have been preferable. Maybe what we should say is that, going forward, is "We will accept an approximation that works for at least 100 years, so long as it can be shown to better match the ground truth of what people use, than we we have implemented already."
  • @sffc - The other concern I have is about code size. A lot of code size comes from datetime, and in particular, the calendrical calculations, and largely because of the libm dependency.
  • @Manishearth: More strong version of my argument than I hold. "correctness" depends on use case, for the use case of 50 years out you have the "google calendar use case" for actual scheduling. After that it is an academic curiosity; "correctness" is a matter of correctness with some algorithm, and we are picking established algorithms from the Good Book.
  • @Manishearth: ideal path forward: we use the Roozbeh approximation for the next 100 years and then fall back to the book stuff. which may not be correct either, but it won't be known incorrect.
  • @sffc - Maybe we want to have a mode where we run calendrical calculations only in datagen (or look up data from a new cloud source) for +/- 100 years, and outside that range we return a bad approximation that at least doesn't pull in the expensive libm dependencies.
  • @Manishearth - Yeah, we could add something like that when people ask for it.
  • @echeran - This does seem like a change from my mental model. It is more nuanced. Like CLDR, we're not trying to get the perfect linguistic answer. The answer can drift over time. So it sounds reasonable that all we're trying to do is align with what people expect. I agree that it would have been nice to be aligned on this earlier in the development process.
  • @echeran - I agree about Persian feeling like a special case. Are we being nuanced or are we making this a special case?

Conclusion:

Shane's version:

  • ICU4X calendars should be performant for calculations in +/- 100 years.
    • This can be done via loaded data, or hardcoded tables, or approximations.
  • The algorithm employed should align with ground truth as much as possible for dates in that range.
  • For dates outside that range ("distant dates"), we employ some algorithm that is easy to reference. It does not need to be performant, but it should return something plausible.
    • Ground truth cannot be predicted for distant dates; such values are more of an academic concern.
    • For now, we generally use the astronomical calculations for distant dates.
    • In the future, if needed, we could have a flag to remove the libm dependency and use coarser approximations for distant dates.
  • In the narrow case of the Persian calendar, we use the Roozbeh approximation for the dates in which it is valid, and one of the book algorithms for dates outside that range. If using the book astronomical code, we adopt the Tehran reference point Roozbeh suggests.

LGTM: @sffc @Manishearth

Manish's version:

  • For the next 50/100 years we want our calendars to be performant and likely to match ground truth as much as possible. We should pick algorithms that work for those periods, and also produce accurate results for the past for as long as that calendar has been in use. This could in the limit involve hardcoding tables, using approximations, etc.
  • Further out than that, we want to have our calendars match some algorithm that is easy to reference, since "correctness" for these calendars is more of an academic concern: ground truth cannot be predicted that far out, and typically disagreements in approximations occur when there is going to be nitty-gritty decisionmaking on the ground ("do we look at the sun from east tehran or west tehran")
  • For Persian we should switch to the Roozbeh algorithm for the range in which it is valid and and optionally make it so that post 100/200 (whatever, wherever the cutoff is) years it falls back to the book math. We can also introduce tabular components if needed, especially if it's better for codesize to just pack an array of long/short years.

Agreed: @Manishearth, @younies, @echeran

@roozbehp
Copy link
Contributor Author

roozbehp commented Apr 4, 2024

Thanks a lot for the discussion and the detailed notes. I have a question: what is the theoretical date range we care about here? I have been thinking about compressed tables generated using the (slightly modified) astronomical algorithm from the Calendrical Calculations book (Persian calendar being relatively regular, we just need to keep leap year meta patterns). But I don't think the astronomical methods in the book return meaningful results beyond, say, 9999 CE (which is the end of ISO 8601), if that. Creating tables for all of i32 will take a lot of space and is probably noise anyways.

@roozbehp
Copy link
Contributor Author

roozbehp commented Apr 4, 2024

BTW, since the Iranian calendar authority website is not always accessible from outside Iran (and it's in Persian anyways), I just created a data file based on it and put it at:
https:/roozbehp/persiancalendar/blob/main/kabise.txt

@Manishearth
Copy link
Member

I have a question: what is the theoretical date range we care about here?

I think as far in the past as that version of the calendar has been in use (so e.g. even though the Japanese calendar is old, the specific gregorian-based version they use now is relatively new), and my general take is 100 years into the future.

@roozbehp
Copy link
Contributor Author

roozbehp commented Apr 5, 2024

I still don't know Rust, so I did a prototype for the replacement algorithm in Python. The code is very closely based on the Calendrical Calculations code, but with the 2820-year rule replaced by the 33-year rule and an override table. Its results match the (modified) astronomical algorithm from 1178 to 3000 AP (about 1799 to 3621 CE). If you don't need to go that much into the future, you can just drop some rows from the override table.

The code is very short and fast. It just needs a set membership lookup, which I hope Rust can do cheaply.

Here's the code (Apache-licensed, since it's based on the Calendrical Calculations code): https:/roozbehp/persiancalendar/blob/main/persiancalendar_fast.py

@roozbehp
Copy link
Contributor Author

roozbehp commented Apr 5, 2024

Ok, I couldn't sleep so I taught myself some Rust and created a pull request to replace the existing arithmetic rule with the one in ICU and ICU4J. This should fix the immediate concern about March 20, 2025 and will work correctly until the end of 1501 AP (~2122 CE). I can work on getting the override table in after that.

@sffc
Copy link
Member

sffc commented Apr 5, 2024

I have a question. For distant dates, say 1000 or more years in the future, is there a drift between the 33-year and 2820-year cycles relative to what we predict to be the ground truth (the astronomical calculations)? For example, the 33-year cycle has a 100% accuracy rate for the next 100 years, and >90% for at least a few hundred years after that. We know that the 2820-cycle has a ~90% accuracy rate for the next 100 years. But 1000 years in the future, would the 2820 cycle hold its 90% accuracy rate while the 33 cycle drifts away?

What I'm getting at is that we could perhaps keep both models and switch to 2820 over a certain bound.

@roozbehp
Copy link
Contributor Author

roozbehp commented Apr 6, 2024

For distant dates, say 1000 or more years in the future, is there a drift between the 33-year and 2820-year cycles relative to what we predict to be the ground truth (the astronomical calculations)? For example, the 33-year cycle has a 100% accuracy rate for the next 100 years, and >90% for at least a few hundred years after that. We know that the 2820-cycle has a ~90% accuracy rate for the next 100 years. But 1000 years in the future, would the 2820 cycle hold its 90% accuracy rate while the 33 cycle drifts away?

I did a quick comparison for the period 2403 AP (exactly 1000 years into the future) to 9378 AP (9999 CE, end of ISO 8601). The number of leap years that was wrong for the 33-year algorithm was 2511, compared to 2996 for the 2820-year algorithm. I also compared the period 1304 AP (introduction of the caldendar in 1925 CE) to 2403 AP (exactly a thousand years into the future). That came up with 70 wrong leap years for the 33-year rule and 205 wrong leap years for the 2820-year rule.

What I'm getting at is that we could perhaps keep both models and switch to 2820 over a certain bound.

Looks like the 2820-year is just simply worse over any period we may care about, even for years far into the future.

@sffc sffc removed the discuss-priority Discuss at the next ICU4X meeting label Apr 9, 2024
@sffc
Copy link
Member

sffc commented Apr 11, 2024

Thanks for the comparison @roozbehp!

The number of leap years that was wrong for the 33-year algorithm was 2511, compared to 2996 for the 2820-year algorithm.

By "wrong" I assume you mean relative to the astronomical calculation?

If the calendar correctly determines a leap year, does it imply that the dates in that year (new year and month lengths) will be correct?

@roozbehp
Copy link
Contributor Author

By "wrong" I assume you mean relative to the astronomical calculation?

Yes. The “modified” astronomical calculation (using the meridian the Iranian calendar authority uses). That's the only source of truth we have outside of the period that the Iranian calendar authority publishes.

If the calendar correctly determines a leap year, does it imply that the dates in that year (new year and month lengths) will be correct?

Almost, but not exactly. If a calendar correctly determines a leap year, it would imply that the last day in that year and all the dates in the next year will be correct. It's a very good test for the Persian calendar, since by knowing all the leap years, you can quickly and unambiguously deduce everything else. In the Persian literature about the Persian calendar, certain years being leap or not is the shorthand for correctly calculating the calendar.

@roozbehp
Copy link
Contributor Author

Thanks everybody. I just filed https://unicode-org.atlassian.net/browse/ICU-22736 to sync ICU4C and ICU4J with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-calendar Component: Calendars
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants